CN112687332B

CN112687332B - Method, apparatus and storage medium for determining sites of variation at risk of disease

Info

Publication number: CN112687332B
Application number: CN202110268390.0A
Authority: CN
Inventors: 钟韵山; 刘蒙蒙; 张钰; 穆婷; 李云双
Original assignee: Beijing Berry Hekang Biotechnology Co ltd
Current assignee: Beijing Berry Hekang Biotechnology Co ltd
Priority date: 2021-03-12
Filing date: 2021-03-12
Publication date: 2021-07-30
Anticipated expiration: 2041-03-12
Also published as: CN112687332A

Abstract

The present disclosure relates to a method, computing device, and storage medium for determining a risk of disease variant site. The method comprises the following steps: a method for determining a risk for disease variation site, comprising: acquiring comparison result information of the whole exon sequencing sequence of a sample of a to-be-detected object and clinical description information about the to-be-detected object; determining variant sites based on the information of the comparison result of the sequencing sequence of the whole exons, so as to annotate the variant sites; extracting phenotype keywords in the clinical description information via a first neural network model; ranking the candidate genes based on the phenotypic keywords; and filtering the variant sites based on the annotation information for the variant sites; based on the filtered annotation information of the variant loci and ranking information on the candidate genes, input data is generated for inputting a trained predetermined model in order to determine disease risk variant loci on the test subject. The method can automatically, quickly and accurately determine the disease risk mutation sites.

Description

Method, apparatus and storage medium for determining sites of variation at risk of disease

Technical Field

The present disclosure relates generally to bioinformatics processing, and in particular, to methods, computing devices, and computer storage media for determining risk of disease variant sites.

Background

In recent years, with the increase of sequencing throughput and the rapid decrease of cost of high-throughput sequencing technology, high-throughput sequencing technology has been rapidly developed and widely applied to identification of disease pathogenicity sites, chromosome copy number variation and structural variation. Among them, the whole exome sequencing technology is a very representative high throughput sequencing technology, and the technology can detect the mutation of over 20000 gene exome regions in the human genome at one time by capturing the exome regions, and has extremely high clinical value. In practical applications, whole exome sequencing will usually identify tens of thousands of variant sites, annotating and pathogenicity interpreting the vast majority of variant sites, and finally finding one or a few sites that lead to disease risk or phenotype correctly is a challenging task.

Many of the key steps in conventional approaches to determining risk-causing variant sites require manual intervention, e.g., manual interpretation of phenotypic keywords of electronic medical records for matching liver cancer cells (e.g., HPO), etc., and thus automated process solutions for determining risk-causing variant sites are lacking; on the other hand, the manual intervention step significantly reduces the interpretation efficiency, and because different people have some subjective bias for the same clinical description, and the clinical phenotype description format recorded in the electronic case is uncertain, and the information is disorganized and redundant, the conventional solution for determining the mutation site of the risk of pathogenesis requires inefficiency, time consumption, and error generation.

In conclusion, the conventional scheme for determining the mutation site with pathogenic risk has the disadvantages that: the automation for determining the mutation sites at risk of disease cannot be realized, and the method is low in efficiency and easy to generate errors.

Disclosure of Invention

The present disclosure provides a method, computing device, and computer storage medium for determining a risk of disease variant site that enables automatic, rapid, and accurate determination of a risk of disease variant site.

According to a first aspect of the present disclosure, a method for determining a risk of disease variant site is provided. The method comprises the following steps: acquiring comparison result information of the whole exon sequencing sequence of a sample of a to-be-detected object and clinical description information about the to-be-detected object; determining a variant site based on the whole exon sequencing sequence alignment result information so as to annotate the variant site to generate annotation information about the variant site; extracting, via the trained first neural network model, phenotypic keywords in the clinical descriptive information; ranking the candidate genes based on the extracted phenotypic keywords to generate ranking information about the candidate genes, the candidate genes being associated with clinical descriptive information; and filtering the variant sites based on the annotation information for the variant sites; generating input data based on the annotation information of the mutation sites left by filtering and ranking information on the candidate genes; and extracting features of the input data based on the trained predetermined model to determine a risk-of-pathogenesis variant locus for the test subject.

According to a second aspect of the present invention, there is also provided a computing device comprising: at least one processing unit; at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions when executed by the at least one processing unit, cause the computing device to perform the method of the first aspect of the disclosure.

According to a third aspect of the present disclosure, there is also provided a computer readable storage medium having stored thereon a computer program which, when executed by a machine, implements a method according to the first aspect of the present disclosure.

In some embodiments, the first neural network model comprises a first network layer constructed by the first language model, a second network layer constructed by the second language model, and a third network layer, and the extracting, via the trained first neural network model, the phenotypic keywords in the clinical descriptive information comprises: segmenting each sentence in the clinical descriptive information into words or punctuation so as to convert the segmented words or punctuation into corresponding input identifications; converting each corresponding input identifier into a multidimensional first feature vector based on the first network layer; generating a second feature vector based on the second network layer and the predetermined keyword set, wherein the second feature vector is used for indicating whether a phrase consisting of each word and surrounding words belongs to the predetermined keyword set or not; and determining, via the third network layer, a phenotype keyword in the clinical descriptive information based on the first feature vector and the second feature vector.

In some embodiments, the first language model is a BERT model, the second language model is an N-gram model, and the third network layer is constructed based on a conditional random field model.

In some embodiments, determining the phenotypic keywords in the clinical profile comprises: fusing the first feature vector and the second feature vector to generate a fused feature vector; extracting, via a third network layer, features of the fused feature vector to predict a category for each character; and determining a phenotype keyword in the clinical description information based on the predicted category of each character.

In some embodiments, extracting features of the fused feature vector via the third network layer to predict a category for each character comprises: reducing the dimension of the fused feature vector through a fully-connected network layer, so that the dimension of the feature vector after dimension reduction is consistent with the category number of the labels; inputting the feature vectors subjected to the dimension reduction into a third network layer so as to calculate the log likelihood value of each feature vector subjected to the dimension reduction; taking a negative average of the calculated log-likelihood values as a loss value of the first neural network model; and decoding using a viterbi algorithm to predict a class for each character.

In some embodiments, ranking the candidate genes based on the extracted phenotypic keywords to generate ranking information about the candidate genes comprises: identifying the candidate gene and the corresponding syndrome; extracting a phenotype for each corresponding syndrome; calculating, via a second neural network model, similarities of phenotype keywords in the clinical description information to phenotypes of the corresponding syndromes; and ranking the candidate genes based on the calculated similarity so as to generate ranking information on the candidate genes.

In some embodiments, calculating, via the second neural network model, a similarity of the phenotype keyword in the clinical descriptive information to a phenotype of the corresponding syndrome comprises: respectively preprocessing the phenotype keywords in the clinical description information and the phenotype of the corresponding syndrome so as to generate a first phenotype input identifier and a second phenotype input identifier; encoding the first phenotype input identifier and the second phenotype input identifier as a third feature vector and a fourth feature vector for input into a second neural network model; averaging the third feature vectors corresponding to all characters in the phenotype keywords in the clinical description information so as to obtain a first corresponding code of the phenotype keywords in the clinical description information; averaging the fourth feature vectors corresponding to all the characters in the phenotype corresponding to the syndrome so as to obtain a second corresponding code corresponding to the phenotype of the syndrome; and calculating the cosine of the included angle of the first corresponding code and the second corresponding code so as to determine the phenotype similarity of the phenotype keywords in the clinical description information and the corresponding syndrome.

In some embodiments, the training method of the second neural network model comprises: mapping a standard term set of a predetermined database into a multidimensional space according to the similarity between each standard term; randomly extracting two standard terms so as to calculate the similarity between the two standard terms, and using the calculated similarity as a training target value for training; and performing supervised training against the second neural network model in the associated spoken language expression and standard terms of the predetermined database to generate a trained second neural network model.

In some embodiments, supervised training against the second neural network model in the standard terms of the associated spoken language expression and the predetermined database includes: randomly generating a first random number and a second random number between 0 and 1; determining whether the first random number is less than a first predetermined threshold; in response to determining that the first random number is less than the first predetermined threshold, determining a training target value of the second neural network model to be 1 for the spoken keyword as a first input to the second neural network model and the standard terms as a second input to the second neural network model, the standard terms being obtained from a predetermined set of data; responsive to determining that the first random number is greater than or equal to a first predetermined threshold, for the spoken keyword and the standard term, determining that a training target value of the second neural network model is a predetermined data set tree similarity between the standard term labeled by the spoken keyword and the standard term as a second input, the predetermined data set tree including a plurality of nodes, each node corresponding to one standard term; determining whether the second random number is less than a second predetermined threshold; training a second neural network model in response to determining that the second random number is less than a second predetermined threshold; and in response to determining that the second random number is greater than or equal to the second predetermined threshold, causing the spoken keyword to be a second input to the second neural network model and the standard term to be a first input to the second neural network model for training the second neural network model.

In some embodiments, ranking the candidate genes based on the calculated similarities comprises: determining a similarity matrix for the corresponding syndrome based on the plurality of phenotypes corresponding to the corresponding syndrome and the plurality of phenotype keywords extracted from the clinical descriptive information, the similarity matrix indicating similarities between the phenotype corresponding to the corresponding syndrome and the phenotype keywords in the clinical descriptive information; determining an evaluation value for each corresponding syndrome based on the similarity matrix for each corresponding syndrome; and determining evaluation values of the candidate genes based on the evaluation values of the corresponding syndromes so as to rank the candidate genes based on the determined evaluation values of the candidate genes.

In some embodiments, the second neural network model is constructed based on a twin neural network formed by the BERT model.

In some embodiments, the annotation information for the variant site comprises at least: gene function annotation information, gene-related disease information, and crowd frequency information.

In some embodiments, filtering the variant sites based on annotation information about the variant sites comprises:

filtering out benign and possibly benign variant sites based on annotation information about variant sites; filtering out variant sites in response to determining that the population frequency information for the variant sites falls within a predetermined frequency threshold range and that the variant sites are not annotated to relevant disease information of the first predetermined data set; filtering out variant sites in response to determining that gene-related disease information for the variant sites is not annotated to related disease information for the second predetermined data set; and filtering out the variant loci in response to determining that the gene indicated by the gene function annotation information for the variant loci belongs to the predetermined gene range of the third predetermined data set.

In some embodiments, generating the input data based on the annotation information for the variant loci left after filtering and the ranking information for the candidate genes comprises: fusing annotation information of the mutation sites left after filtering and ranking information of the candidate genes; converting annotation information of the fused mutation sites and sequencing information of the candidate genes into feature vectors; normalization processing is performed on the converted feature vectors to generate input data.

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the disclosure, nor is it intended to be used to limit the scope of the disclosure.

Drawings

Fig. 1 shows a schematic diagram of a system for a method of determining a risk of disease variant site according to an embodiment of the present disclosure.

Fig. 2 shows a flow diagram of a method for determining a risk of disease variant site, according to an embodiment of the disclosure.

Fig. 3 shows a flow diagram of a method for extracting phenotypic keywords in clinical descriptive information, according to an embodiment of the present disclosure.

Fig. 4 shows a schematic diagram of a first neural network model according to an embodiment of the present disclosure.

FIG. 5 schematically shows a schematic diagram of a second neural network model shown in accordance with an embodiment of the present disclosure.

FIG. 6 shows a flow diagram of a method for calculating similarity between two phenotypes, according to an embodiment of the present disclosure.

FIG. 7 shows a flow diagram of a training method of a second neural network model, in accordance with an embodiment of the present disclosure.

Fig. 8 shows a schematic diagram of a CHPO tree structure according to an embodiment of the disclosure.

Fig. 9 shows a flow diagram of a method for supervised training for a second neural network model in accordance with an embodiment of the present disclosure.

Fig. 10 shows a flow diagram of a method for ranking candidate genes according to an embodiment of the disclosure.

FIG. 11 schematically shows a block diagram of an electronic device suitable for use to implement an embodiment of the disclosure.

Like or corresponding reference characters designate like or corresponding parts throughout the several views.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The term "include" and variations thereof as used herein is meant to be inclusive in an open-ended manner, i.e., "including but not limited to". Unless specifically stated otherwise, the term "or" means "and/or". The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment". The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like may refer to different or the same object.

As mentioned previously, many of the key steps in conventional approaches to determining risk-of-illness variant sites (e.g., manual interpretation of phenotypic keywords of electronic medical records for matching HPOs) require manual intervention, and thus, automated flow solutions for determining risk-of-illness variant sites are lacking; moreover, the traditional scheme for determining the mutation sites at risk of causing diseases needs to consume much time and is easy to generate errors.

To address, at least in part, one or more of the above problems, as well as other potential problems, example embodiments of the present disclosure propose a scheme for determining risk of disease variation sites. The scheme determines a variation site based on the obtained whole exon sequencing sequence comparison result information of a sample of a to-be-detected object, annotates the variation site, and extracts a phenotype keyword in the obtained clinical description information about the to-be-detected object through a first neural network model; the method and the device can extract the phenotype keywords more accurately and quickly, and avoid the low efficiency and the error proneness of manually reading the phenotype keywords of the electronic medical record. In addition, the candidate genes are ranked based on the extracted phenotype keywords, and the candidate gene ranking performance can be improved. Furthermore, the variant sites are filtered by annotation information based on the variant sites; and extracting features of input data generated based on the filtered annotation information of the mutation sites and the ranking information of the candidate genes based on the trained predetermined model, thereby determining the pathogenic risk mutation sites of the object to be tested. According to the method and the device, data characteristics can be extracted through various technologies such as site filtering, site annotation information and gene sequencing results, vector representation is generated and input into a predetermined model to be used for determining the pathogenic risk mutation sites of the object to be detected, and therefore the determined pathogenic risk mutation sites can be more accurate. Thus, the present disclosure enables automated, rapid, and accurate determination of risk-of-disease mutation sites.

Fig. 1 shows a schematic diagram of a system 100 for a method of determining a risk of disease variant site according to an embodiment of the present disclosure. As shown in FIG. 1, system 100 includes, for example, computing device 110, a letter generation server 150, and a network 140. Computing device 110 may interact with a messaging server 150 in a wired or wireless manner via network 140.

The computing device 110 may be used, for example, to obtain full exon sequencing sequence alignment information and clinical descriptive information about a subject for a sample of the subject; determining a variant site based on the sequencing sequence alignment result information, and annotating the variant site to generate annotation information about the variant site. The computing device 110 may also be used to extract phenotypic keywords in the clinical descriptive information; and ranking the candidate genes based on the extracted phenotypic keywords. In addition, the computing device 110 may be further configured to filter the variant loci based on the annotation information for the variant loci, and extract features of input data generated from the filtered annotation information for the variant loci and the ranking information for the candidate genes based on the trained predetermined model, so as to determine risk-of-illness variant loci for the test subject. In some embodiments, computing device 110 may have one or more processing units, including special purpose processing units such as GPUs, FPGAs, and ASICs, as well as general purpose processing units such as CPUs. In addition, one or more virtual machines may also be running on each computing device. The computing device 110 includes, for example, an alignment result information and clinical description information acquisition unit 112, a mutation site determination and annotation unit 114, a phenotype keyword extraction unit 116, a candidate gene ranking unit 118, a mutation site filtering unit 120, an input data generation unit 122, and a risk of pathogenesis mutation site determination unit 124. The comparison result information and clinical description information obtaining unit 112, the mutation site determining and annotating unit 114, the phenotype keyword extracting unit 116, the candidate gene ranking unit 118, the mutation site filtering unit 120, the input data generating unit 122, and the risk of causing disease mutation site determining unit 124 may be configured on one or more computing devices 110.

An information on alignment result and clinical description information obtaining unit 112, configured to obtain information on alignment result of all exon sequencing sequences of a sample of the test object and information on clinical description of the test object.

A variation site determination and annotation unit 114 for determining a variation site based on the information of the whole exon sequencing sequence alignment result, so as to annotate the variation site to generate annotation information on the variation site.

With regard to the phenotypic keyword extraction unit 116, it is used to extract the phenotypic keywords in the clinical descriptive information via the trained first neural network model.

A candidate gene ranking unit 118 for ranking the candidate genes based on the extracted phenotypic keywords in order to generate ranking information on the candidate genes, the candidate genes being associated with the clinical descriptive information.

A mutation site filtering unit 120 for filtering mutation sites based on the annotation information about the mutation sites.

And an input data generation unit 122 for generating input data based on the annotation information of the mutation sites left by the filtering and the ranking information on the candidate genes.

And a risk-causing mutation site determination unit 124 for extracting features of the input data based on the trained predetermined model so as to determine a risk-causing mutation site for the test subject.

A method for determining a risk of disease variation site according to an embodiment of the present disclosure will be described below with reference to fig. 2. Fig. 2 shows a flow diagram of a method 200 for determining a risk of disease variant site, according to an embodiment of the present disclosure. It should be understood that the method 200 may be performed, for example, at the electronic device 1100 depicted in fig. 11. May also be executed at the computing device 110 depicted in fig. 1. It should be understood that method 200 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.

At step 202, the computing device 110 obtains full exon sequencing sequence alignment information and clinical descriptive information about the subject for a sample of the subject.

As for the sample of the object to be measured, it is, for example, a blood sample of the object to be measured. The clinical descriptive information about the object to be measured is, for example, an electronic medical record of the object to be measured. The object to be measured is, for example, the preceding person himself.

The manner for obtaining the alignment result information of the whole exon sequencing sequence of the sample of the object to be tested includes, for example: the computing device 110 obtains sequencing sequence alignment result information from the messaging server 150. May also include: the computing device 110 first obtains a FASTQ file of whole exon sequencing data via NGS sequencing, and then invokes data alignment software (e.g., BWA) for sequencing sequence alignment (e.g., aligning whole exon sequencing data of a blood sample of a test subject with human reference genomic sequencing data) to generate sequencing sequence alignment result information.

At step 204, the computing device 110 determines a variant site based on the whole exon sequencing sequence alignment result information for annotation of the variant site to generate annotation information about the variant site.

Regarding the method for determining the mutation site, it includes, for example: after the computing device 110 obtains the sequencing sequence alignment result information, it invokes genetic data variation detection software (e.g., without limitation, invoking a HaplotypeCaller module of GATK to detect variations as points in the obtained sequencing sequence alignment result information in order to detect SNP and Indel variation sites, thereby generating a VCF file.

Annotated information about the mutation sites, for example, includes at least: gene function annotation information, gene-related disease information, population frequency information. In some embodiments, the annotation information for the variant site further comprises: pathogenicity prediction software annotation. Methods for annotating variant sites include, for example, a variety of methods. For example, the computing device 110 invokes ANNOVAR to annotate the mutation sites, resulting in accurate annotation information for each SNP and Indel mutation site. ANNOVAR is a command line tool written by perl and can be executed on various operating systems provided with perl interpreters.

At step 206, the computing device 110 extracts, via the trained first neural network model, phenotypic keywords in the clinical descriptive information. For example, the computing device 110 identifies a clinical phenotype keyword of the proband from a clinical phenotype description in the electronic case, the first neural network model being trained via the multi-sample.

With respect to the multi-sample used to train the first neural network model, it is, for example, a training data set formed via manual labeling of a plurality (e.g., without limitation, 6000) of clinical cases. The annotation method is, for example, a BIEOS annotation method for word-by-word annotation. For example, the description of clinical phenotypes in electronic cases "complaints: short, the corresponding labels via manual labeling are: o, O, O, B, E.

Regarding the first neural network model, it includes, for example, a first network layer constructed by a first language model, a second network layer constructed by a second language model, and a third network layer. In some embodiments, the first neural network model takes a pre-trained language model Bidirectional Encoder replication from transformations (BERT) as a network subject architecture, and a multi-algorithm fused neural network is introduced. For example, the first neural network model is constructed based on introducing a Conditional Random Field model (CRF), a dictionary embedding, and an N-gram model based on BERT. The architecture of the first neural network model will be described below in conjunction with fig. 4. Fig. 4 shows a schematic diagram of a first neural network model 400 according to an embodiment of the present disclosure. As shown in fig. 4, the first neural network model 400 includes, for example, a first network layer 404 constructed by a first language model (the first language model is, for example and without limitation, a BERT model), a second network layer 406 constructed by a second language model (the second language model is, for example and without limitation, an N-gram model), and a third network layer (the third network layer is, for example and without limitation, constructed based on CRF) 410. The first network layer 404 is used, for example, to convert the input into a first feature vector (e.g., T1-T5 shown in fig. 4). The second network layer is for generating a second feature vector (e.g., G1-G5 shown in fig. 4) based on the input and the set of predetermined keywords. In some embodiments, the first neural network model 400 further includes a network layer 408, the network layer 408 configured to fuse the first feature vector and the second feature vector to generate a fused feature vector.

A method for extracting phenotypic keywords in clinical descriptive information, comprising: segmenting each sentence in the clinical descriptive information into words or punctuation so as to convert the segmented words or punctuation into corresponding input identifications; converting each corresponding input identifier into a multidimensional first feature vector based on the first network layer; generating a second feature vector based on the second network layer and the predetermined keyword set, wherein the second feature vector is used for indicating whether a phrase consisting of each word and surrounding words belongs to the predetermined keyword set or not; and determining, via the third network layer, a phenotype keyword in the clinical descriptive information based on the first feature vector and the second feature vector. The method 300 for extracting the phenotype keyword from the clinical description information will be described in detail with reference to fig. 3, and will not be described herein again.

Table 1 below schematically shows part of the network parameters of the first neural network model. The remaining network parameters not shown are, for example, default values.

At step 208, the computing device 110 ranks the candidate genes based on the extracted phenotypic keywords to generate ranking information about the candidate genes, the candidate genes being associated with clinical descriptive information.

Regarding a method of generating ranking information on candidate genes, it includes, for example: identifying the candidate gene and the corresponding syndrome; extracting a phenotype for each corresponding syndrome; calculating, via a second neural network model, similarities of phenotype keywords in the clinical description information to phenotypes of the corresponding syndromes; and ranking the candidate genes based on the calculated similarity so as to generate ranking information on the candidate genes.

Specifically, for example, first the computing device 110 confirms the candidate genes and corresponding syndromes. Examples of the method for identifying candidate genes include: genes with a clear correspondence syndrome are determined to be candidate genes in a fourth predetermined data set (e.g., the OMIM database). Methods of identifying the corresponding syndrome include, for example: a syndrome with a definite corresponding gene is determined as a corresponding syndrome in a fourth predetermined data set (e.g., OMIM database). Among them, the candidate gene and the corresponding syndrome are, for example, many-to-many relationship.

Second, the computing device 110 extracts the phenotype of each corresponding syndrome. For example, the computing device 110 extracts the phenotype corresponding to each corresponding syndrome according to a phenotypeannotion file provided by the HPO official website.

Thereafter, the computing device 110 calculates, via the second neural network model, similarities of the phenotype keywords in the clinical descriptive information to the phenotype of the corresponding syndrome. For example, for two phenotypes input separately to the first input and the second input of the second neural network model, a similarity calculation is performed via the second neural network model. Regarding the method for calculating the similarity between the phenotype keyword in the clinical description information and the phenotype of the corresponding syndrome, the following will be described in detail with reference to fig. 6, and will not be repeated herein.

Further, the computing device 110 ranks the candidate genes based on the calculated similarities so as to generate ranking information about the candidate genes. It is understood that a clinical case or clinical profile often contains multiple phenotypic keywords, and a gene or syndrome often results in multiple different phenotypes. Therefore, a comprehensive calculation method is required to determine the final ranking of the individual candidate genes. Regarding the method for determining the final rank of the single candidate gene, the following will be described in detail with reference to fig. 9, and will not be described herein again.

With respect to the second neural network model, it is used to calculate the similarity between the two phenotypes. The second neural Network model is, for example, a twin neural Network (Siamese Network) constructed based on the BERT model. FIG. 5 schematically illustrates a schematic diagram of a second neural network model 500 shown in accordance with an embodiment of the present disclosure. The second neural network model 500 includes, for example and without limitation: a first BERT network layer 504, a first pooling layer 506, a second BERT network layer 514, a second pooling layer 516, and a similarity computation network layer 520.

Regarding the first pooling layer 506 or the second pooling layer 516, which are constructed by means of mean-posing, for example, respectively, the calculation manner of the first pooling layer 506 and the second pooling layer 516 is described below with reference to formula (1).

In the above equation (1), Vi represents the output vector of the BERT network layer at the ith position, and x represents the output of the first pooling layer 506 or the second pooling layer 516.

As shown in fig. 5, a first input 502 and a second input 512 are input into a first BERT network layer 504 and a second BERT network layer 514, respectively. The output of the first BERT network layer 504, via a first pooling layer 506 (e.g., a posing layer), forms a corresponding encoding 508 of a phenotypic keyword in the clinical profile. The output of the second BERT network layer 514, via a second pooling layer 516 (e.g., a posing layer), forms a corresponding code 518 for a phenotype of the corresponding syndrome. The corresponding code 508 of the phenotype keyword and the corresponding code 518 of the phenotype corresponding to the syndrome in the clinical descriptive information calculate the cosine of the angle between the two vectors via the similarity calculation network layer 520 to determine the similarity between the two phenotypes. The algorithm of the similarity calculation network layer 520 is, for example, cosine _ sim (u, v). Where cosine _ sim () represents the cosine of the angle between the two u and v vectors. u represents the corresponding code of the phenotypic keyword in the clinical profile. v represents the corresponding code for the phenotype of the corresponding syndrome.

In some embodiments, the similarity calculation network layer 520 is calculated, for example, as shown in the following equation (2).

In the above equation (2), x represents the feature vector output by the first pooling layer. y represents the feature vector output by the second pooling layer.

The Cosine (or called Cosine _ distance) representing the angle between the feature vector output by the first pooling layer and the feature vector output by the second pooling layer is related to the training samples of the second neural network model 500, which are, for example, a plurality of labeled corpus samples. The corpus sample is, for example, a phenotype keyword in the clinical descriptive information extracted by human. For each phenotype keyword, for example, the closest phenotype or phenotypes are selected from the standard terms in a predetermined data set (e.g., without limitation, a CHPO database) as the CHPO standard term corresponding to the phenotype keyword in the manually extracted clinical descriptive information. For example, the following table 2 schematically shows the artificially extracted clinical descriptive informationThe corresponding relationship between the phenotype keyword in (1) and the CHPO standard term.

Regarding the training method of the second neural network model 500, a two-step Fine-tuning method can be adopted to obtain the final trained second neural network model 500. The first step is non-supervised training, which mainly realizes similarity calculation between standard terms, and the second step is supervised training, which is used for realizing similarity calculation from spoken language description to standard description. The training method 700 of the second neural network model 500 is described in detail below with reference to fig. 7 and 8. Here, the description is omitted.

The loss function of the second neural network model 500 is, for example, as shown in the following equation (3).

In the above equation (3), loss represents a loss function of the second neural network model 500. object _ sim represents a training target value of the second neural network model 500, and cosine _ sim represents the similarity of the first input and the second input calculated via the second neural network model 500 constructed based on the twin neural network structure.

Table 3 below schematically shows part of the network parameters of the second neural network model. The remaining network parameters not shown are, for example, default values.

At step 210, the computing device 110 filters the variant loci based on the annotation information for the variant loci.

Regarding the means for filtering the variant sites, there are, for example, the following means:

based on the annotation information about the variant sites, benign and possibly benign variant sites are filtered out. For example, the computing device 110 filters the loci of Benign, Likely Benign based on the score scores in the locus annotation information.

In response to determining that the population frequency information for the variant loci falls within a predetermined frequency threshold range and that the variant loci are not annotated to relevant disease information of the first predetermined data set, filtering out the variant loci. For example, the computing device 110 filters according to the crowd frequency in the site annotation information: if greater than 0.05 or greater than 0.01 and less than 0.05 and no pathogenicity information is annotated to the HGMD or ClinVar database, then the site is filtered out.

In response to determining that the gene-related disease information for the variant sites is not annotated to the related disease information of the second predetermined data set, filtering out variant sites. For example, the computing device 110 filters according to disease-related content in the site annotation information: if the information related to HGMD, OMIM and ClinVar database diseases is not annotated, the site is filtered out.

In response to determining that the gene indicated by the gene function annotation information for the variant site belongs to the predetermined range of genes in the third predetermined data set, filtering out the variant site. For example, the computing device 110 filters according to the gene name in the site annotation information: if the gene HGNC with phenotype in OMIM is considered as the gene, if the gene is not in the gene range, the site is filtered.

At step 212, the computing device 110 generates input data based on the annotation information for the variant loci left by the filtering and the ranking information for the candidate genes. For example, the computing device 110 obtains the variant loci left after filtering in the above-mentioned several ways, and obtains annotation information of the variant loci left after filtering; and generating input data based on the annotation information of the variant loci left by the filtering and ranking information on the candidate genes.

Regarding the method of generating input data, it includes, for example: the computing device 110 fuses annotation information of the variant loci left after filtering and ranking information of the candidate genes; converting annotation information of the fused mutation sites and sequencing information of the candidate genes into feature vectors; and performing normalization processing on the converted feature vectors to generate input data. For example, the computing device 110 fuses and converts the filtered annotation information of the mutation sites and the ordering information of the candidate genes into feature vectors by using a vector coding method such as Ordinal encoding, OneHotEncoding, continuous variable segmentation representation (KBinsDiscretizer) and the like, and combining a plurality of technical methods such as feature statistics, feature fusion and the like. Then, normalization processing is performed on the converted feature vectors in the manner of the following formula (4), and then input data is generated based on the feature vectors after the normalization processing.

In the above formula (4), x_iRepresenting the ith feature vector after normalization. max (x)_i) Representing the maximum value of the ith feature vector. min (x)_i) Represents the minimum of the ith eigenvector.

At step 214, the computing device 110 extracts features of the input data based on the trained predetermined model in order to determine disease risk variation sites for the test subject.

With respect to the predetermined model, it is constructed based on, for example and without limitation, a Random Forest (RF) model. The predetermined model is feature selected, for example, using bootstrap with a set-back sample and Gini coefficients.

The processing method of the preset model constructed based on the random forest model comprises the following steps: returning the sample set to randomly sample and selecting n samples; randomly selecting k features from all the features, and establishing a decision tree for the selected samples by using the features; repeating the two steps for m times to generate m decision trees to form a random forest; for new data, through each tree decision, the decision is finally voted to confirm which category is assigned. The criterion for selection of the kini coefficients is that each child node achieves the highest purity, i.e. all observations falling in the child node belong to the same category, at which time the kini coefficients are the smallest, the highest purity, the smallest uncertainty, the more thorough and cleaner the data segmentation.

Examples of methods for determining the risk-causing mutation site in relation to a test subject are: the computing device 110 first calculates evaluation values for the variant loci via the trained predictive model, so as to rank from high to low based on the calculated variant locus evaluation values; and then determining whether the calculated evaluation value of the mutation sites is greater than or equal to a predetermined site threshold value or not according to the sequencing sequence of the mutation sites, and if the evaluation value of the mutation sites is determined to be greater than or equal to the predetermined site threshold value, determining the mutation sites as pathogenic risk mutation sites related to the object to be detected. And if the evaluation values of all the variant sites are determined to be less than the predetermined site threshold value, determining that the sample is negative.

Table 4 below schematically shows the results of prediction on the risk-of-pathogenesis variant sites of the test subjects.

In the above scheme, a variant site is determined by obtaining whole exon sequencing sequence alignment result information of a sample of a test object, and is annotated for the variant site, and a phenotype keyword in the obtained clinical description information about the test object is extracted through a first neural network model; the method and the device can extract the phenotype keywords more accurately and quickly, and avoid the low efficiency and the error proneness of manually reading the phenotype keywords of the electronic medical record. In addition, the candidate genes are ranked based on the extracted phenotype keywords, and the candidate gene ranking performance can be improved. Furthermore, the variant sites are filtered by annotation information based on the variant sites; and extracting features of input data generated based on the filtered annotation information of the mutation sites and the ranking information of the candidate genes based on the trained predetermined model, thereby determining the pathogenic risk mutation sites of the object to be tested. According to the method and the device, data characteristics can be extracted through various technologies such as site filtering, site annotation information and gene sequencing results, vector representation is generated and input into a predetermined model to be used for determining the pathogenic risk mutation sites of the object to be detected, and therefore the determined pathogenic risk mutation sites can be more accurate. Thus, the present disclosure enables automated, rapid, and accurate determination of risk-of-disease mutation sites.

Fig. 3 shows a flow diagram of a method 300 for extracting phenotypic keywords in clinical descriptive information, according to an embodiment of the present disclosure. It is to be appreciated that the method 300 can be performed, for example, at the electronic device 1100 depicted in FIG. 11. May also be executed at the computing device 110 depicted in fig. 1. It should be understood that method 300 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.

At step 302, the computing device 110 segments each sentence in the clinical descriptive information into words or punctuation to convert the segmented words or punctuation into corresponding input identifications. For example, the computing device 110 segments by words or punctuation for each sentence in the clinical phenotype description in the electronic case of the proband himself. The segmented words or punctuation are then converted into corresponding identifications (e.g., IDs). As shown in fig. 4, the whole segment of clinical description information 402 is "main complaint: short and small. The clinical descriptive information 402 is segmented into words or punctuation, i.e. into: "main", "complaint", "main", "complaint": "," short "and" small ".

The method of converting the divided words or punctuation into corresponding identifications includes various methods, for example. For example, the computing device 110 may extract the input identifications corresponding to the segmented words or punctuation according to the pre-created character set of the BERT model. It should be appreciated that the respective identifications transformed by the segmented Word or punctuation using the Word2Vec model are generally independent of the context in which the segmented Word or punctuation appears, as compared to the respective identifications generated by the BERT model that are associated with words or punctuation surrounding the segmented Word or punctuation. Therefore, the method can capture context-related words and other forms of information besides the differences such as word ambiguity, so that the converted corresponding identifications can be represented more accurately, and the performance of the model is improved. The following steps of the method 300 for extracting phenotypic keywords from clinical description information are exemplified by the first network layer of the first neural network model constructed based on the BERT model.

At step 304, the computing device 110 converts each corresponding input identification to a multidimensional first feature vector based on the first network layer, the first language model being a BERT model. For example, the computing device 110 encodes the character using the BERT model for each corresponding input identification to obtain a 768-dimensional first feature vector.

At step 306, the computing device 110 generates a second feature vector based on the second network layer and the predetermined set of keywords, the second feature vector indicating whether each word and the phrase of surrounding words belongs to the predetermined set of keywords.

The predetermined keyword set is, for example, Dictionary examining, which includes: disease Chinese name generated by translation of disease English name in OMIM, common disease name and its abbreviation, and name of gene in refgene database.

As the second language model, there is, for example, an N-Gram model.

Regarding the method of generating the second feature vector, it includes, for example: respectively setting the length of the N-gram to be 1-6, respectively representing that each word and peripheral words form a phrase with a preset length, and generating a 21-dimensional one-hot vector according to whether the phrase appears in a preset keyword set, wherein '1' indicates that the formed phrase appears in the preset keyword set, and '0' indicates that the formed phrase does not appear in the preset keyword set.

At step 308, the computing device 110 determines, via the third network layer, a phenotype keyword in the clinical descriptive information based on the first feature vector and the second feature vector. The third network layer is constructed based on, for example and without limitation, Conditional Random Fields (CRFs).

In some embodiments, the third network layer is constructed, for example, using a linear chain element random field model. The manner in which the third network layer calculates the conditional probability for each location is described below in conjunction with equation (5).

In the above formula (5), t and s represent a feature function, where t represents a transition feature, s represents a state feature, x is an observed variable, and y is an implied variable.

Regarding the method of determining phenotypic keywords in clinical descriptive information, it includes, for example: first, the computing device 110 fuses the first feature vector and the second feature vector to generate a fused feature vector. For example, the computing device 110 fuses, for each segmented word or symbol, the first feature vector (e.g., 768-dimensional feature vector) transformed via step 304 and the generated second feature vector (e.g., 21-dimensional feature vector) obtained via step 306 to generate a fused feature vector, i.e., a new 789-dimensional feature vector.

The computing device 110 then extracts features of the fused feature vector via a third network layer to predict a category for each character. The method of predicting the category with respect to each character includes, for example: and reducing the dimension of the fused feature vector through a full-connection network layer, so that the dimension of the feature vector after the dimension reduction is consistent with the category number of the labels. The computing device 110 then inputs the reduced-dimension feature vectors into a third network layer (e.g., a conditional random field model) to compute a log-likelihood (log-likelihood) value for each reduced-dimension feature vector. The computing device 110 then takes a negative average of the calculated log-likelihood values as a third network layer (e.g., a conditional random field model). While decoding using the Viterbi (Viterbi) algorithm to predict the class for each character.

Thereafter, the computing device 110 determines a phenotypic keyword in the clinical descriptive information based on the predicted category of each character. For example, corresponding phenotypic keywords of the entire clinical description are extracted. For example, as shown in FIG. 4, the entire segment of clinical descriptive information 402 input into the language model 410 is "chief complaint: short and small. The predicted category 412 for each character is, for example, "O", "B", "E". The phenotype keyword 420 is determined to be "short" according to the predicted category of each character.

In the scheme, the reading efficiency of the whole clinical description information can be improved, and the subjective difference of different reading personnel can be reduced.

FIG. 6 illustrates a flow diagram of a method 600 for a second neural network model to compute a similarity between two phenotypes, in accordance with embodiments of the present disclosure. It is to be appreciated that the method 600 can be performed, for example, at the electronic device 1100 depicted in fig. 11. May also be executed at the computing device 110 depicted in fig. 1. It should be understood that method 600 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.

At step 602, the computing device 110 pre-processes the phenotype keywords and the corresponding syndromes' phenotypes in the clinical description information, respectively, to generate a first phenotype input identifier and a second phenotype input identifier. For example, the computing device 110 segments the phenotypic keywords in the clinical description information into words or punctuation; the input identifier corresponding to the segmented word or punctuation is then extracted according to the pre-created character set of the BERT model to generate a first phenotype input identifier (e.g., first input 502 shown in fig. 5) corresponding to the first phenotype. The computing device 110 segments the phenotype of the corresponding syndrome into words or punctuation; the input identifier corresponding to the segmented word or punctuation is then extracted from the pre-created character set of the BERT model to generate a second phenotype input identifier (e.g., second input 512 of fig. 5) corresponding to the second phenotype.

At step 604, the computing device 110 encodes the first phenotype input identification and the second phenotype input identification as a third feature vector and a fourth feature vector for input into a second neural network model. For example, a first phenotype input identifier and a second phenotype input identifier are encoded as 768-dimensional feature vectors, respectively, via a first BERT network layer and a second BERT network layer, respectively.

At step 606, the computing device 110 averages the third feature vectors corresponding to all of the characters in the phenotypic keyword in the clinical description information to obtain a first corresponding encoding of the phenotypic keyword in the clinical description information. As shown in fig. 5, a first phenotype corresponding to a phenotype keyword in the clinical description information is input as a first input 502 and a second phenotype corresponding to a phenotype of the syndrome is input as a second input 512 into a first BERT network layer 504 and a second BERT network layer 514, respectively. The output of the first BERT network layer 504, after passing through the first pooling layer 506, forms a first corresponding code, i.e., a corresponding code 508 for a phenotypic keyword in the clinical profile.

At step 608, the computing device 110 averages the fourth feature vectors corresponding to all of the characters in the phenotype of the corresponding syndrome to obtain a second corresponding encoding of the phenotype of the corresponding syndrome. For example, the output of the second BERT network layer 514, via the second pooling layer 516, forms a second corresponding code 518 corresponding to the phenotype of the syndrome.

At step 610, the computing device 110 calculates the cosine of the angle between the first corresponding code and the second corresponding code in order to determine the phenotypic similarity of the phenotypic keyword in the clinical description with the corresponding syndrome.

By employing the above approach, the present disclosure can quickly and accurately determine the similarity between two phenotypes, a phenotype keyword in the clinical descriptive information and a phenotype of the corresponding syndrome.

FIG. 7 shows a flow diagram of a training method 700 of a second neural network model, in accordance with an embodiment of the present disclosure. It is to be appreciated that the method 700 can be performed, for example, at the electronic device 1100 depicted in FIG. 11. May also be executed at the computing device 110 depicted in fig. 1. It should be understood that method 700 may also include additional acts not shown and/or may omit acts shown, as the scope of the present disclosure is not limited in this respect.

At step 702, the computing device 110 maps a predetermined database (e.g., without limitation, CHPO) standard term set into a multidimensional (e.g., 768-dimensional) space according to the similarity between the terms.

At step 704, the computing device 110 randomly draws two standard terms in order to calculate a similarity between the two standard terms for training with the calculated similarity as a training target value. For example, the CHPO tree similarity of two standard terms is calculated, and the calculated CHPO tree similarity is trained as a training target value.

The CHPO tree similarity itself has a tree structure, and thus, the similarity of phenotypes should satisfy the tree structure. Fig. 8 shows a schematic diagram of a CHPO tree structure 800 according to an embodiment of the disclosure. In the CHPO tree structure 800 shown in fig. 8, a "phenotypical exception" node is, for example, a primary node 810, which defines the depth of the node as 1 (depth = 1). Nodes with a depth of 2 (depth = 2) are for example secondary nodes of the respective system, such as: "nervous system abnormal" node 820, "limb abnormal" node 822, "cardiovascular system abnormal" node 824, etc. Nodes with a depth greater than 2 (e.g., depth =3, depth = 4), and so on. For example, the "autism" node and the "autism behavior" node (not shown) are adjacent nodes in the CHPO tree structure 800 and thus have high similarity, while the "autism" node and the "polycystic kidney dysplasia" node are, for example, far apart in the CHPO tree structure 800 and thus have low similarity.

A calculation method for calculating the CHPO tree similarity between the node i and the node j in any two CHPO tree structures 800 is described below with reference to formula (6).

In the above equation (6), i represents the ith node. j represents the jth node. x represents the deepest common parent node of node i and node j. depth_iRepresenting the depth of node i. depth_jRepresenting the depth of node j. depth_xRepresenting the depth of node x. Where 5.278 is twice log (14), and 14 is the depth of the deepest node. CHPO _ sim_ijRepresenting the CHPO tree similarity between node i and node j.

At step 706, the computing device 110 conducts supervised training against the second neural network with the associated spoken language expressions and predetermined database (e.g., without limitation, CHPO) standard terms to generate a trained second neural network model.

A method of supervised training for a second neural network is described below in conjunction with fig. 9. Fig. 9 shows a flow diagram of a method 900 for supervised training for a second neural network in accordance with an embodiment of the present disclosure. It is to be appreciated that the method 900 can be performed, for example, at the electronic device 1100 depicted in FIG. 11. May also be executed at the computing device 110 depicted in fig. 1. It should be understood that method 900 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.

At step 902, the computing device 110 randomly generates a first random number and a second random number between 0 and 1.

At step 904, the computing device 110 determines whether the first random number is less than a first predetermined threshold.

At step 906, if the computing device 110 determines that the first random number is less than the first predetermined threshold, a training target value for the second neural network model is determined to be 1 for the spoken keyword as a first input to the second neural network model and the standard terms as a second input to the second neural network model, the standard terms being obtained from a predetermined set of data. The standard term is for example CHPO standard term.

At step 908, if the computing device 110 determines that the first random number is greater than or equal to the first predetermined threshold, for the spoken keyword and the standard term, determining that a training target value of the second neural network model is a predetermined data set tree similarity between the standard term labeled by the spoken keyword and the standard term as the second input, the predetermined data set tree including a plurality of nodes, each node corresponding to one standard term. For example, if it is determined that the first random number is greater than or equal to the first predetermined threshold, for the spoken keyword and the CHPO standard term, the training target value of the second neural network is determined to be the CHPO tree similarity between the standard term labeled for the spoken keyword and the CHPO standard term.

At step 910, the computing device 110 determines whether the second random number is less than a second predetermined threshold.

If the computing device 110 determines that the second random number is less than the second predetermined threshold, at step 912, a second neural network model is trained.

At step 914, if the computing device 110 determines that the second random number is greater than or equal to the second predetermined threshold such that the spoken keyword serves as the second input to the second neural network model and the standard term serves as the first input to the second neural network model, it jumps to step 912 for training the second neural network model.

In the above scheme, the similarity between the phenotype keywords in the clinical description information expressed in the spoken language or term manner and the phenotype of the corresponding syndrome can be more accurately calculated by the above unsupervised training based on the standard terms and the supervised training based on the associated spoken language and the standard terms.

The method of ranking candidate genes will be described below with reference to fig. 10. Fig. 10 shows a flow diagram of a method 1000 for ranking candidate genes according to an embodiment of the disclosure. It should be understood that the method 1000 may be performed, for example, at the electronic device 1100 depicted in fig. 11. May also be executed at the computing device 110 depicted in fig. 1. It should be understood that method 1000 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.

At step 1002, the computing device 110 determines a similarity matrix for the corresponding syndrome based on the plurality of phenotypes corresponding to the corresponding syndrome and the plurality of phenotype keywords extracted from the clinical descriptive information, the similarity matrix indicating similarities between the phenotype corresponding to the corresponding syndrome and the phenotype keywords in the clinical descriptive information. For example, if m corresponding CHPO phenotypes are identified as candidate syndrome i, and n phenotype keywords can be extracted from the clinical description information, a similarity matrix is constructed for candidate syndrome i, for example, following expression (7).

In the above expression (7), matrix_iRepresenting the constructed similarity for candidate syndrome i. m represents the mth corresponding CHPO phenotype to candidate syndrome i. n represents the nth phenotype keyword in the clinical description information. Sim (m, n) represents the similarity between the mth corresponding CHPO phenotype of the candidate syndrome i and the nth phenotype keyword in the clinical descriptive information.

At step 1004, the computing device 110 determines an evaluation value for each corresponding syndrome based on the similarity matrix for each corresponding syndrome. The calculation method of the evaluation value of the candidate syndrome is described below with reference to formula (8).

In the above expression (8), syndrome _ score_iRepresenting the evaluation value calculated for the corresponding syndrome i. matrix_iRepresenting the similarity matrix constructed for candidate syndrome i. an axis of 1 indicates that the matrix operation should be performed in columns. average () represents an average operation. max () represents a max operation.

At step 1006, the computing device 110 determines evaluation values of the candidate genes based on the evaluation values of the corresponding syndromes, so as to rank the candidate genes based on the determined evaluation values of the candidate genes. For example, candidate gene x corresponds to n corresponding syndromes, x1, x2, and xn, respectively, and the calculation manner of the evaluation value of the candidate syndrome is described below with reference to formula (9).

In the above expression (9), gene _ score_xRepresenting the evaluation value calculated for candidate gene x. max () represents a max operation. x1, x 2.., xn respectively corresponds to n corresponding syndromes corresponding to the candidate gene x。syndrome_score_{x1, x2, ..., xn}And representing the evaluation values of the n candidate syndromes corresponding to the candidate gene x. max (syndrome _ score)_{x1, x2, ..., xn}) The representation takes the maximum value among the evaluation values of n corresponding syndromes.

FIG. 11 schematically shows a block diagram of an electronic device 1100 suitable for use to implement embodiments of the present disclosure. The device 1100 may be a device for implementing the

methods

200, 300, 600, 700, 900, 1000 shown in fig. 2, 3, 6, 7, 9 and 10. As shown in fig. 11, the device 1100 includes a CPU1101, which may perform various appropriate actions and processes according to computer program instructions stored in a Read Only Memory (ROM) 1102 or loaded from a storage unit 1108 into a Random Access Memory (RAM) 1103. In the RAM1103, various programs and data necessary for the operation of the device 1100 may also be stored. The CPU1101, ROM 1102, and RAM1103 are connected to each other by a bus 1104. An input/output (I/O) interface 1105 is also connected to bus 1104.

A number of components in device 1100 connect to I/O interface 1105, including: an input unit 1106, an output unit 1107, a storage unit 1108, and the CPU1101 perform the respective methods and processes described above, for example, the

methods

200, 300, 600, 700, 900, 1000. For example, in some embodiments, the

methods

200, 300, 600, 700, 900, 1000 may be implemented as a computer software program stored on a machine-readable medium, such as the storage unit 1108. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1100 via ROM and/or communication unit 1109. When loaded into RAM and executed by a CPU, the computer program may perform one or more of the operations of the

methods

200, 300, 600, 700, 900, 1000 described above. Alternatively, in other embodiments, the CPU may be configured by any other suitable means (e.g., by way of firmware) to perform one or more of the acts of the

methods

200, 300, 600, 700, 900, 1000.

It should be further appreciated that the present disclosure may be embodied as methods, apparatus, systems, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for carrying out various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor in a voice interaction device, a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

The above are merely alternative embodiments of the present disclosure and are not intended to limit the present disclosure, which may be modified and varied by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A method for determining a risk for disease variation site, comprising:

acquiring comparison result information of a whole exon sequencing sequence of a sample of a to-be-detected object and clinical description information about the to-be-detected object;

determining a variant site based on the whole exon sequencing sequence alignment result information so as to annotate the variant site to generate annotation information about the variant site, wherein the annotation information about the variant site at least comprises: gene function annotation information, gene-related disease information;

extracting, via a trained first neural network model, phenotypic keywords in the clinical description information, the first neural network model comprising a first network layer constructed by a BERT model, a second network layer constructed by an N-gram model, and a third network layer constructed based on a conditional random field model;

ranking the candidate genes based on the extracted phenotypic keywords to generate ranking information about the candidate genes associated with the clinical descriptive information;

filtering the variant sites based on annotated information about the variant sites;

generating input data for inputting the trained predetermined model based on the annotation information of the mutation sites left by the filtering and the ranking information on the candidate genes; and

extracting features of the input data based on the trained predetermined model to determine risk-of-pathogenesis variant loci with respect to the test subject,

wherein extracting phenotypic keywords in the clinical descriptive information via the trained first neural network model comprises:

segmenting each sentence in the clinical descriptive information into words or punctuation so as to convert the segmented words or punctuation into corresponding input identifications;

converting each corresponding input identifier into a multidimensional first feature vector based on the first network layer;

generating a second feature vector based on the second network layer and a predetermined keyword set, the second feature vector being used for indicating whether a phrase consisting of each word and surrounding words belongs to the predetermined keyword set; and

determining, via the third network layer, a phenotype keyword in the clinical description information based on the first feature vector and the second feature vector.

2. The method of claim 1, wherein determining a phenotypic keyword in the clinical description information comprises:

fusing the first feature vector and the second feature vector to generate a fused feature vector;

extracting, via the third network layer, features of the fused feature vector to predict a category for each character; and

determining a phenotype keyword in the clinical description information based on the predicted category of each character.

3. The method of claim 2, wherein extracting features of the fused feature vector via the third network layer to predict a category for each character comprises:

reducing the dimension of the fused feature vector through a fully-connected network layer, so that the dimension of the feature vector after dimension reduction is consistent with the category number of the labels;

inputting the feature vectors subjected to the dimension reduction into the third network layer so as to calculate the log likelihood value of each feature vector subjected to the dimension reduction;

taking a negative average of the calculated log-likelihood values as a loss value of the first neural network model; and

decoding is performed using the viterbi algorithm to predict the class for each character.

4. The method of claim 1, wherein ranking the candidate genes based on the extracted phenotypic keywords to generate ranking information about the candidate genes comprises:

identifying the candidate gene and the corresponding syndrome;

extracting a phenotype for each corresponding syndrome;

calculating similarities of the phenotype keywords in the clinical descriptive information to the phenotype of the corresponding syndrome via a second neural network model, the second neural network model being constructed based on a twin neural network formed by a BERT model; and

based on the calculated similarity, the candidate genes are ranked so as to generate ranking information on the candidate genes.

5. The method of claim 4, wherein calculating, via a second neural network model, similarities of phenotype keywords in the clinical description information to phenotypes of corresponding syndromes comprises:

respectively preprocessing the phenotype keywords in the clinical description information and the phenotype of the corresponding syndrome so as to generate a first phenotype input identifier and a second phenotype input identifier;

encoding the first phenotype input identifier and the second phenotype input identifier as a third feature vector and a fourth feature vector for input into a second neural network model;

averaging the third feature vectors corresponding to all characters in the phenotype keywords in the clinical description information so as to obtain a first corresponding code of the phenotype keywords in the clinical description information;

averaging the fourth feature vectors corresponding to all the characters in the phenotype corresponding to the syndrome so as to obtain a second corresponding code corresponding to the phenotype of the syndrome; and

calculating the cosine of the included angle of the first corresponding code and the second corresponding code so as to determine the phenotype similarity of the phenotype keywords in the clinical description information and the corresponding syndrome.

6. The method of claim 4, wherein the training method of the second neural network model comprises:

mapping a standard term set of a predetermined database into a multidimensional space according to the similarity between each standard term;

randomly extracting two standard terms so as to calculate the similarity between the two standard terms, and using the calculated similarity as a training target value for training; and

supervised training is performed against the second neural network model in terms of the associated spoken language expression and standard terms of a predetermined database to generate a trained second neural network model.

7. The method of claim 6, wherein supervised training for the second neural network model in terms of associated spoken language expressions and standard terms of a predetermined database comprises:

randomly generating a first random number and a second random number between 0 and 1;

determining whether the first random number is less than a first predetermined threshold;

in response to determining that the first random number is less than the first predetermined threshold, determining a training target value of the second neural network model to be 1 for the spoken keyword as a first input to the second neural network model and a standard term as a second input to the second neural network model, the standard term obtained from a predetermined set of data;

responsive to determining that the first random number is greater than or equal to a first predetermined threshold, for the spoken keyword and the standard term, determining a training target value of a second neural network model as a predetermined data set tree similarity between the standard term labeled by the spoken keyword and the standard term as the second input, the predetermined data set tree including a plurality of nodes, each node corresponding to one standard term;

determining whether the second random number is less than a second predetermined threshold;

training the second neural network model in response to determining that the second random number is less than a second predetermined threshold; and

in response to determining that the second random number is greater than or equal to a second predetermined threshold, causing spoken keywords to be used as second inputs to the second neural network model and standard terms to be used as first inputs to the second neural network model for training the second neural network model.

8. The method of claim 4, wherein ranking candidate genes based on the calculated similarities comprises:

determining a similarity matrix for the corresponding syndrome based on the plurality of phenotypes corresponding to the corresponding syndrome and the plurality of phenotype keywords extracted from the clinical descriptive information, the similarity matrix indicating similarities between the phenotype corresponding to the corresponding syndrome and the phenotype keywords in the clinical descriptive information;

determining an evaluation value for each corresponding syndrome based on the similarity matrix for each corresponding syndrome; and

based on the evaluation values of the corresponding syndromes, evaluation values of the candidate genes are determined so as to rank the candidate genes based on the determined evaluation values of the candidate genes.

9. The method of claim 1, wherein the annotation information for a mutation site further comprises population frequency information.

10. The method of claim 9, wherein filtering variant sites based on annotation information about the variant sites comprises:

filtering out benign and possibly benign variant sites based on annotation information about variant sites;

filtering out variant sites in response to determining that the population frequency information for the variant sites falls within a predetermined frequency threshold range and that the variant sites are not annotated to relevant disease information of a first predetermined data set;

filtering out variant sites in response to determining that gene-related disease information for the variant sites is not annotated to related disease information of a second predetermined data set; and

filtering out the variant sites in response to determining that the genes indicated by the gene function annotation information for the variant sites belong to a predetermined range of genes of a third predetermined data set.

11. The method of claim 1, wherein generating input data based on annotation information for the variant loci left after filtering and ranking information for the candidate genes comprises:

fusing annotation information of the mutation sites left after filtering and ranking information of the candidate genes;

converting annotation information of the fused mutation sites and sequencing information of the candidate genes into feature vectors; and

performing normalization processing on the converted feature vectors to generate the input data.

12. A computing device, comprising:

at least one processing unit;

at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions when executed by the at least one processing unit causing the computing device to perform the method of any of claims 1-11.

13. A computer-readable storage medium, having stored thereon a computer program which, when executed by a machine, implements the method of any of claims 1-11.