CN115952304B

CN115952304B - Method, device, equipment and storage medium for retrieving variant literature

Info

Publication number: CN115952304B
Application number: CN202310232304.XA
Authority: CN
Inventors: 蔡娇; 许青青; 盛磊; 陈梅; 任子云; 余蕾; 方云倩; 张学杰; 徐昕; 苗翠翠; 王建峰
Original assignee: Suzhou Chaoyun Life Intelligence Industry Research Institute Co ltd
Current assignee: Suzhou Chaoyun Life Intelligence Industry Research Institute Co ltd
Priority date: 2023-03-13
Filing date: 2023-03-13
Publication date: 2023-05-30
Anticipated expiration: 2043-03-13
Also published as: CN115952304A

Abstract

The invention discloses a method, a device, equipment and a storage medium for retrieving variant documents. The method comprises the following steps: constructing at least one search combination set based on at least one search entity in the received search data; determining at least one reference variation document corresponding to the search combination and a search weight value corresponding to each reference variation document respectively based on a pre-constructed document knowledge graph for each search combination in the search combination set; based on each retrieval weight value, sequencing each reference variation document to obtain a reference sequencing result corresponding to the retrieval combination; determining a document retrieval result corresponding to the retrieval data based on each reference ranking result; the literature knowledge graph comprises at least one association relation corresponding to at least one preset variation literature and at least one variation type respectively and a preset weight value corresponding to each association relation respectively, and the variation type represents a combination form of at least one preset entity. The embodiment of the invention improves the efficiency of document retrieval.

Description

Method, device, equipment and storage medium for retrieving variant literature

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for retrieving a variant document.

Background

With the completion of the human genome project in 2003, human beings have a qualitative fly through in their own genetic information. Genetic testing is a hotspot in clinical diagnostics and scientific research, and more new literature on human genetic variation is recorded by PubMed databases, with thousands of literature containing the pathogenic potential of various mutation sites.

When a researcher interprets a mutation site, a large number of mutation databases are consulted, and a traditional mutation search engine is used for displaying a large number of related documents to the researcher together according to whether a plurality of search words coexist in the documents or not as a search standard, and the researcher needs to self-locate article fragments appearing in the search words to judge whether the article fragments can serve as a reference or not. Therefore, the retrieval results given by the conventional variant retrieval engine have poor referenceability and low document retrieval efficiency.

Disclosure of Invention

The embodiment of the invention provides a retrieval method, a device, equipment and a storage medium of a variant document, which are used for solving the problem of poor referenceability of a retrieval result given by a traditional variant retrieval engine and improving the document retrieval efficiency.

According to one embodiment of the present invention, there is provided a method for retrieving a variant document, including:

in response to receiving the search data, constructing a search combination set based on at least one search entity in the search data;

determining at least one reference variation document corresponding to the search combination and a search weight value respectively corresponding to each reference variation document based on a pre-constructed document knowledge graph for each search combination in the search combination set;

based on the retrieval weight values, sequencing the reference mutation documents to obtain a reference sequencing result corresponding to the retrieval combination;

determining a document retrieval result corresponding to the retrieval data based on the reference sorting results respectively corresponding to the retrieval combinations;

the literature knowledge graph comprises association relations respectively corresponding to at least one preset variation literature and at least one variation type and preset weight values respectively corresponding to the association relations, and the variation type represents a combination form of at least one preset entity.

According to another embodiment of the present invention, there is provided a device for retrieving a mutated document, including:

A search combination set construction module for constructing a search combination set based on at least one search entity in search data in response to receiving the search data;

a reference mutation document determination module configured to determine, for each search combination in the search combination set, at least one reference mutation document corresponding to the search combination and a search weight value corresponding to each of the reference mutation documents, based on a document knowledge graph constructed in advance;

the reference ranking result determining module is used for ranking the reference mutation documents based on the retrieval weight values to obtain reference ranking results corresponding to the retrieval combinations;

a document retrieval result determining module for determining a document retrieval result corresponding to the retrieval data based on the reference ranking results respectively corresponding to the retrieval combinations;

According to another embodiment of the present invention, there is provided an electronic apparatus including:

At least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the method for retrieving variant documents according to any of the embodiments of the present invention.

According to another embodiment of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to execute the method for searching for a variant document according to any of the embodiments of the present invention.

According to the technical scheme, the method comprises the steps of constructing a variation knowledge graph containing at least one preset variation document, association relations corresponding to at least one variation type respectively and preset weight values corresponding to the association relations respectively in advance, constructing a retrieval combination set based on at least one retrieval entity in received retrieval data, determining at least one reference variation document corresponding to the retrieval combination and retrieval weight values corresponding to the reference variation documents respectively based on the pre-constructed document knowledge graph aiming at each retrieval combination in the retrieval combination set, sorting the reference variation documents based on the retrieval weight values to obtain reference sorting results corresponding to the retrieval combination, determining document retrieval results corresponding to retrieval data based on the reference sorting results, and solving the problem that the retrieval results given by a traditional variation retrieval engine are poor in referenceability, so that the variation documents more meeting the retrieval requirements of users are ranked in front, and accordingly, the efficiency of document retrieval is improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a method for searching variant documents according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a knowledge graph of a document according to an embodiment of the present invention;

FIG. 3 is a flowchart of another method for searching variant documents according to an embodiment of the present invention;

FIG. 4 is a flowchart of a specific entity of a method for identifying and aligning preset entities according to an embodiment of the present invention;

FIG. 5 is a flowchart of another method for searching variant documents according to an embodiment of the present invention;

FIG. 6 is a flow chart of a method for determining a search combination set according to an embodiment of the present invention;

FIG. 7 is a flowchart of a method for determining a search variation set according to an embodiment of the present invention;

FIG. 8 is a flowchart of another method for searching variant documents according to an embodiment of the present invention;

FIG. 9 is a flowchart of another method for determining a search combination set according to an embodiment of the present invention;

FIG. 10 is a flowchart of a specific example of a method for searching for variant documents according to an embodiment of the present invention;

FIG. 11 is a schematic structural diagram of a device for searching a variant document according to an embodiment of the present invention;

fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," "initial," "target," "reference," "preset," "retrieve," and the like in the description and claims of the present invention and in the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Fig. 1 is a flowchart of a method for searching a variant document according to an embodiment of the present invention, where the method may be performed by a variant document searching device, and the variant document searching device may be implemented in hardware and/or software, and the variant document searching device may be configured in a terminal device. As shown in fig. 1, the method includes:

S110, in response to receiving the search data, constructing a search combination set based on at least one search entity in the search data.

In this embodiment, each search entity in the search data includes at least a search gene, and on this basis, each search entity in the search data may further include a search amino acid, a search nucleotide, a search transcript, a search disease, and the like.

Wherein, specifically, the search combination set comprises at least one search combination, and the search combination comprises at least one search entity.

In an alternative embodiment, the search data includes at least a search gene, and correspondingly, constructing a search combination set based on at least one search entity in the search data includes: when the search data further includes a search disease, adding the search gene and the search disease as a search combination to a search combination set; when the search data further includes a search amino acid and a search nucleotide, the search gene, the search amino acid, and the search nucleotide are added as a search combination to the search combination set.

For example, assuming that the retrieved data includes gene 1, disease 1, amino acid 1, and nucleotide 1, the retrieved combination set comprises [ gene 1 disease 1] and [ gene 1 amino acid 1 nucleotide 1].

S120, determining at least one reference variation document corresponding to the search combination and a search weight value corresponding to each reference variation document respectively based on a pre-constructed document knowledge graph for each search combination in the search combination set.

In this embodiment, the literature knowledge graph includes at least one association relationship corresponding to at least one preset variation literature and at least one variation type, and a preset weight value corresponding to each association relationship, where the variation type represents a combination form of at least one preset entity.

In an alternative embodiment, each variation type in the literature knowledge graph includes GD variation and/or GPN variation. Wherein, the GD variation characterizes a combination of a predetermined gene and a predetermined disease, and the GPN variation characterizes a combination of a predetermined gene, a predetermined amino acid, and a predetermined nucleotide. Illustratively, the variant names of the GD variant and the GPN variant may be composed of the entity names of the corresponding preset entities, e.g., the variant name of the GPN variant may be "GJB2: p.lys224gln: c.670a > C".

Fig. 2 is a schematic diagram of a document knowledge graph according to an embodiment of the present invention. Specifically, each preset mutation document in the document knowledge graph shown in fig. 2 includes a mutation document 1, a mutation document 2, and a mutation document 3. Wherein, "G" represents a Gene (Gene), "D" represents a Disease (Disease), "P" represents an amino acid (Protein), "N" represents a Nucleotide (Nucleotide), the numbers following the letters represent examples of each preset entity, "gx+dy" represents GD variation, "gx+py+nz" represents GPN variation, the dotted arrow represents an association relationship, and the numbers on the dotted arrow represent preset weight values corresponding to the association relationship. In fig. 2, only "gx+dy" and "gx+py+nz" are used to distinguish the mutation types, and in the actual literature knowledge graph, "gx+dy" and "gx+py+nz" may be replaced by actual mutation names.

In an alternative embodiment, determining at least one reference variation document corresponding to the search combination and a search weight value corresponding to each reference variation document respectively based on a pre-constructed document knowledge graph includes: determining a search variation type based on the search combination; determining at least one reference variation document and a retrieval weight value corresponding to each reference variation document based on the document knowledge graph and the retrieval variation type; the search mutation type is search GD mutation or search GPN mutation.

In one embodiment, determining the search variation type based on the search combination includes: when the search combination set contains the search combination composed of the search gene and the search disease, determining a search GD variation based on the search combination; when the search combination set includes a search combination of a search gene, a search amino acid, and a search nucleotide, a search GPN variation is determined based on the search combination.

For example, when the search combination is [ Gene 1 disease 1], the search GD is mutated to G1+D1, and when the search combination is [ Gene 1 amino acid 1 nucleotide 1], the search GPN is mutated to G1+P1+N1.

In an alternative embodiment, determining at least one reference variation document and a retrieval weight value corresponding to each reference variation document based on the document knowledge graph and the retrieval variation type includes: taking a preset variation document which is related to the retrieval variation type in the document knowledge graph as a reference variation document, and taking a preset weight value corresponding to the association relation as a retrieval weight value corresponding to the reference variation document.

Taking fig. 2 as an example, assuming that the search mutation type is g1+d1, each reference mutation document includes document 1 and document 2, and the search weight values corresponding to document 1 and document 2 are 0.51 and 0.9, respectively. Assuming that the search mutation type is g1+p1+n1, each reference mutation document includes document 2 and document 3, and the search weight values corresponding to document 2 and document 3 are 0.46 and 0.5, respectively.

S130, sorting the reference variant documents based on the retrieval weight values to obtain a reference sorting result corresponding to the retrieval combination.

Specifically, for each search combination, the reference mutation documents are ranked based on the search weight value corresponding to each reference mutation document corresponding to the search combination, so as to obtain the reference ranking result corresponding to the search combination.

Taking the above example as an example, the reference ranking result corresponding to [ gene 1 disease 1] is [ document 2 document 1], and the reference ranking result corresponding to [ gene 1 amino acid 1 nucleotide 1] is [ document 3 document 2].

S140, determining a document retrieval result corresponding to the retrieval data based on the reference sorting results respectively corresponding to the retrieval combinations.

In an alternative embodiment, determining the document retrieval result corresponding to the retrieval data based on the reference ranking result respectively corresponding to each retrieval combination includes: when the number of the combinations of the search combinations in the search combination set is one, taking a reference ordering result corresponding to the search combination as a document search result corresponding to the search data; when the number of the combinations of the retrieval combinations in the retrieval combination set is at least two, respectively acquiring target sorting results from all the reference sorting results based on the preset sorting number; and sorting the target sorting results based on the priorities corresponding to the search combinations, and taking the sorting results as document search results corresponding to the search data.

The number of preset ranks is 100, which is not limited herein, and may be set by a user according to actual requirements. For example, when the ranking order of the reference ranking results is a descending ranking, the first 100 reference variant documents in the reference ranking results are taken as the target ranking results.

In an alternative embodiment, the search combinations of the search genes and the search diseases have a lower priority than the search combinations of the search genes, the search amino acids, and the search nucleotides.

For example, assuming that the reference ranking result 1 is determined based on a search combination composed of a search gene and a search disease, the reference ranking result 2 is determined based on a search combination composed of a search gene, a search amino acid, and a search nucleotide, the reference ranking result 1 is [ document 1 document 2], the reference ranking result 2 is [ document 3 document 4 document 5], and the document search result [ document 3 document 4 document 5 document 1 document 2].

In an alternative embodiment, the number of combinations of the search combinations in the search combination set is at least two, and accordingly, before determining the document search result corresponding to the search data based on the reference ranking result respectively corresponding to each search combination, the method further includes: when at least one repeated variation document exists in each reference ranking result, deleting the repeated variation document from the reference ranking results corresponding to the retrieval combination with lower priority for each repeated variation document, and obtaining the screened reference ranking result.

Wherein, the repeated variation document is used for representing the reference variation document existing in at least two reference sequencing results or representing the reference variation document respectively corresponding to at least two search combinations. Deleting the repeated mutation document from the reference sorting results corresponding to the retrieval combination with lower priority, wherein the repeated mutation document in the reference sorting result with the highest retention priority is represented, and deleting the repeated mutation document in other reference sorting results.

For example, assuming that the reference ranking result 1 is determined based on a search combination composed of a search gene and a search disease, the reference ranking result 2 is determined based on a search combination composed of a search gene, a search amino acid, and a search nucleotide, the reference ranking result 1 is [ document 1, document 2, document 3], and the reference ranking result 2 is [ document 3, document 4, document 5], then the document 3 is deleted from the reference ranking result 1, and the corresponding document search result is [ document 3, document 4, document 5, document 1, and document 2].

The advantage of this arrangement is that the existence of a plurality of identical variant documents in the document retrieval result can be avoided, and the accuracy of the document retrieval result is improved.

In another alternative embodiment, determining a document search result corresponding to search data based on reference ranking results respectively corresponding to each search combination includes: respectively acquiring target variation documents with preset sequencing quantity from each reference sequencing result, and acquiring matching fragments corresponding to the retrieval combination in each target variation document; inputting the matching segments corresponding to the target mutation documents and the search combination into a pre-trained evidence classification model aiming at each target mutation document to obtain the matching probability corresponding to the target mutation document; and sorting all the target variant documents in a descending order based on all the matching probabilities to obtain document retrieval results corresponding to the retrieval combinations.

The number of preset ranks is 100, which is not limited herein, and may be set by a user according to actual requirements. Specifically, the matching segments are used for representing document segments of each retrieval entity in the retrieval combination.

In an alternative embodiment, the model architecture of the evidence classification model is a biolink bert pre-training model. The BioLinkBERT model takes the first token position (CLS position) output by the last layer as the vector representation of the matching segment, and inputs the vector representation into the full-connection layer for classification, so as to obtain the matching probability. The activation function employed by the BioLinkBERT model is, illustratively, a sigmoid function.

The method has the advantages that according to the service requirement of genetic interpretation, the importance of supporting evidence on the retrieval of the variant documents is considered, the precise sequencing of the retrieved variant documents is realized, and the document retrieval efficiency is further improved.

On the basis of the above embodiment, optionally, the method further includes: and outputting matching fragments corresponding to the target variant documents in the document search results and the document search results. This has the advantage of assisting researchers in more efficiently performing genetic interpretation work.

According to the technical scheme, a retrieval combination set is constructed based on at least one retrieval entity in received retrieval data by pre-constructing a variation knowledge graph containing at least one association relation corresponding to at least one preset variation document and at least one variation type respectively and preset weight values corresponding to the association relations, and for each retrieval combination in the retrieval combination set, the retrieval weight values corresponding to at least one reference variation document and each reference variation document corresponding to the retrieval combination are determined based on the pre-constructed document knowledge graph, the reference variation documents are ranked based on the retrieval weight values to obtain reference ranking results corresponding to the retrieval combination, the document retrieval results corresponding to the retrieval data are determined based on the reference ranking results, and the problem that the retrieval results given by a traditional variation retrieval engine are poor in referenceability is solved, so that the variation documents which are more in line with the retrieval requirements of users are ranked in front, and the document retrieval efficiency is improved.

Fig. 3 is a flowchart of another method for searching a variant document according to an embodiment of the present invention, where the method for constructing a knowledge graph in the above embodiment is further refined. As shown in fig. 3, the method includes:

S210, acquiring at least one mutation type associated with each preset mutation document in the preset mutation document set.

Specifically, a plurality of preset mutation documents are collectively recorded in the preset document.

In an alternative embodiment, obtaining at least one mutation type associated with the preset mutation document includes: acquiring at least two preset entities corresponding to a preset variation document, and constructing at least one preset entity pair based on each preset entity; inputting the preset entity pairs and entity fragments corresponding to the preset entity pairs in the preset variation literature into a pre-trained relation extraction model aiming at each preset entity pair to obtain an output entity relation; and determining at least one mutation type associated with the preset mutation document based on the entity relation corresponding to each preset entity pair. Wherein each preset entity pair comprises at least one entity pair formed by preset genes and preset diseases and at least one entity pair formed by preset genes, preset amino acids and preset nucleotides.

In an alternative embodiment, obtaining at least two preset entities corresponding to the preset mutation document includes: and acquiring at least two preset entities corresponding to the preset variation document by using a variation entity identification tool. The variant entity recognition tool may be a tmvar entity recognition tool, which is not limited herein.

In another alternative embodiment, obtaining at least two preset entities corresponding to the preset mutation document includes: and inputting the preset variation literature into a pre-trained entity recognition model to obtain at least two output preset entities.

The exemplary method includes the steps of firstly marking entity data of 10000 variant documents by adopting a tmvar entity identification tool, then guiding a reference entity set preliminarily marked by the tmvar entity identification tool into a doccano text marking tool, and manually checking the entity by a professional to obtain standard entity sets respectively corresponding to 10000 variant documents. The standard entity set is used for training the entity recognition model.

In one embodiment, the model structure of the target entity recognition model is the bert+ Efficient GlobalPointer model.

The advantage of this setting is that the bert+ Efficient GlobalPointer model identifies the entity as a whole from end to end, and is more global, and compared with other model architectures, the bert+ Efficient GlobalPointer model reduces a large number of model parameters and reduces the risk of overfitting.

Specifically, the preset entity pairs are used for representing entity pairs formed by at least two preset entities in the preset variation literature. In this embodiment, each preset entity pair includes at least one entity pair composed of a preset gene and a preset disease and at least one entity pair composed of a preset gene, a preset amino acid, and a preset nucleotide.

Wherein, the entity types of the preset entities comprise 4 genes, diseases, amino acids and nucleotides. For example, assuming that each preset entity includes gene 1, disease 2, amino acid 1, amino acid 2, and nucleotide 1, when the mutation type includes GD mutation, each preset entity pair includes gene 1& disease 1 and gene 1& disease 2, and when the mutation type includes GPN mutation, each preset entity pair includes gene 1& amino acid 1, gene 1& nucleotide 1, gene 1& amino acid 2, amino acid 1& nucleotide 1, and amino acid 2& nucleotide 1.

In this embodiment, the mutation type includes GD mutation and/or GPN mutation, wherein GD mutation characterizes a combination of a predetermined gene and a predetermined disease, and GPN mutation characterizes a combination of a predetermined gene, a predetermined amino acid, and a predetermined nucleotide.

The entity fragment is specifically used for representing a document fragment recorded with two preset entities in a preset entity pair.

In an alternative embodiment, the model framework of the relational extraction model is the PURE model in the Pipeline method. The encoder in the PURE model may adopt a BERT model, where the BERT model is used to encode the preset entity pair and the entity fragment, and input the encoded vector into a linear transformation layer, and the linear transformation layer is used to output the entity relationship of the preset entity pair based on the encoded vector. The linear transformation layer is, for example, a layernorm+dropout+ classifier.

The entity relationship may specifically represent whether there is a relationship between two preset entities in a preset entity pair, and of course, the entity relationship may also represent a relationship type between two preset entities in the preset entity pair. In this embodiment, for example, when the preset entity pair is an entity pair constituted by a preset gene and a preset disease, the entity relationship may be "related disease", when the preset entity pair is a preset gene and a preset amino acid, the entity relationship may be "amino acid variation", when the preset entity pair is a preset gene and a preset nucleotide, the entity relationship may be "nucleotide variation", and when the preset entity pair is a preset amino acid and a preset nucleotide, the entity relationship may be "nucleotide variation".

In an alternative embodiment, determining at least one mutation type associated with the preset mutation document based on the entity relationship corresponding to each preset entity pair includes: when the preset entity pair is an entity pair formed by a preset gene and a preset disease, constructing GD variation associated with a preset variation document based on the preset entity pair under the condition that the entity relationship of the preset entity pair is a relationship; when each preset entity pair comprises entity pairs formed by preset genes, preset amino acids and preset nucleotides, and the entity relationships corresponding to the entity pairs formed by the preset genes, the preset amino acids and the preset nucleotides are all in a relationship, the GPN variation related to the preset variation literature is constructed based on the preset genes, the preset amino acids and the preset nucleotides.

For example, when the preset entity pair of document 1 is gene 1& disease 1, assuming that the entity relationship outputted by the target relationship extraction model is 1 or "related disease", each mutation type associated with the preset mutation document includes GD mutation "g1+d1". When each preset pair of entities of document 1 includes gene 1& amino acid 1, gene 1& nucleotide 1, amino acid 1& nucleotide 1, gene 1& amino acid 2, and amino acid 2& nucleotide 1, assuming that the entity relationships outputted by the target relationship extraction model are 1, 0, and 1, respectively, each mutation type associated with the preset mutation document includes GPN mutation "g1+p1+d1".

S220, determining a preset weight value of an association relation between the preset mutation document and the mutation type according to the occurrence position of the mutation type in the preset mutation document for each mutation type.

Wherein, exemplary, the weight value is preset

The formula is satisfied: />

Wherein,,

indicate->

Each preset entity corresponding to each mutation type is at the +.>

The weight value corresponding to the occurrence position in each preset variation document. The exemplary title location and keyword location have a weight of 3, the abstract location has a weight of 2, and the chart location and text location have a weight of 1. Specifically, when each preset entity appears at a plurality of positions in the preset mutation document, the position with the largest weight value is selected as the appearance position of the mutation type.

Wherein,,

indicate->

Each preset entity corresponding to each mutation type is at the +.>

Frequency of occurrence in the individual preset variation literature, < >>

Indicate->

Each preset entity corresponding to each mutation type is at the +.>

The number of occurrences in a predetermined variation document,

indicate->

Total number of mutation types corresponding to the preset mutation literature, < >>

Indicate->

Inverse document frequency corresponding to each variant type, +.>

The number of documents representing the preset variation document in the preset variation document set,/->

The representation comprises->

Number of preset mutation documents of individual mutation types.

Wherein, in particular, when

Each preset entity corresponding to each mutation type is at the +.>

The higher the frequency of occurrence in the individual preset variation documents, the lower the frequency of occurrence in the other preset variation documents, the +.>

Variant type pair->

The more important the pre-set mutation literature is, correspondingly, & ltth & gt>

Type of variation and->

The larger the preset weight value of the association relation corresponding to the preset variation documents is.

S230, constructing a literature knowledge graph based on the association relation between each preset mutation literature and at least one mutation type and the preset weight value corresponding to each association relation.

The Neo4j tool is used for building a literature knowledge graph, has high stability and usability, supports data storage, retrieval and processing, can bear hundreds of millions of nodes, relations and attributes, and is high in practicality.

Specifically, for each preset variation document, adding the preset variation document into a document knowledge graph, judging whether the variation type exists in the document knowledge graph for each variation type corresponding to the preset variation document, if so, building an association relationship between the preset variation document and the variation type in the document knowledge graph, and adding a preset weight value corresponding to the association relationship; if not, adding the variation type into the literature knowledge graph, building an association relation between a preset variation literature and the variation type in the literature knowledge graph, and adding a preset weight value corresponding to the association relation.

On the basis of the above embodiment, optionally, the method further includes: in response to detecting an update instruction of the preset variation literature set, an update operation is performed on the literature knowledge graph based on update data corresponding to the preset variation literature set. Exemplary updating operations include, but are not limited to, building new literature nodes, building new variant type nodes, building new associations, deleting old literature nodes, deleting old variant type nodes, deleting old associations, and the like.

The method has the advantages that timeliness of the literature knowledge graph can be guaranteed, and timeliness and accuracy of the search result are guaranteed.

S240, in response to receiving the search data, constructing a search combination set based on at least one search entity in the search data.

S250, determining at least one reference variation document corresponding to the search combination and a search weight value corresponding to each reference variation document respectively according to each search combination in the search combination set based on a pre-constructed document knowledge graph.

And S260, sorting the reference variant documents based on the retrieval weight values to obtain a reference sorting result corresponding to the retrieval combination.

S270, determining a document retrieval result corresponding to the retrieval data based on the reference sorting results respectively corresponding to the retrieval combinations.

According to the technical scheme of the embodiment, at least one variation type associated with the preset variation documents is obtained for each preset variation document in the preset variation document set, the preset weight value of the association relation between the preset variation document and the variation type is determined based on the occurrence position of the variation type in the preset variation document for each variation type, the document knowledge graph is constructed based on the association relation between each preset variation document and at least one variation type and the preset weight value corresponding to each association relation, the problem that the unstructured variation document cannot construct the knowledge graph is solved, the constructed document knowledge graph is more in accordance with the search requirement of a user, and the document search efficiency is improved.

On the basis of the foregoing embodiment, optionally, before constructing at least one preset entity pair based on each preset entity, the method further includes: adopting a preset alignment strategy, and respectively executing alignment operation on each preset entity based on each standard entity in a standard database to obtain at least one aligned preset entity; the preset alignment strategy comprises at least one of a preset alignment algorithm, a first mapping list and a vector similarity algorithm, wherein the first mapping list represents at least one preset entity corresponding to each standard entity in the standard database.

Exemplary criteria databases include, but are not limited to, authoritative databases such as HGNC, OMIM, clinvar.

In an alternative embodiment, a preset alignment policy is adopted, and based on each standard entity in the standard database, an alignment operation is performed on each preset entity to obtain at least one aligned preset entity, which includes: aiming at each preset entity, adopting a preset alignment algorithm, and respectively executing alignment operation on each preset entity based on each standard entity in a standard database to obtain at least one aligned preset entity; and/or, based on the preset entity, inquiring the first mapping list to obtain an aligned preset entity; the first mapping list characterizes at least one preset entity corresponding to each standard entity in the standard database; and/or respectively inputting the preset entity and each standard entity in the standard database into a target semantic extraction model which is trained in advance to obtain a preset vector corresponding to the preset entity and a standard vector corresponding to each standard entity which are output; and determining the aligned preset entity based on the preset vector and each standard vector.

In one embodiment, the preset alignment algorithm may be a regular matching algorithm, which is not limited herein.

In another embodiment, specifically, the first mapping list includes a plurality of standard entities and at least one preset entity corresponding to each standard entity. Illustratively, the standard entity is the amino acid p.glu615gly, and each preset entity corresponding to the standard entity includes, but is not limited to, E615G, glu615Gly, p.e615g, p.glu615g, and the like. The standard entity is a gene GJB2, and each preset entity corresponding to the standard entity comprises, but is not limited to, gap junction protein beta 2, DFNB1, CX26 and the like.

In another embodiment, an exemplary cosine distance algorithm is adopted to obtain the vector similarity corresponding to the preset vector and each standard vector, and the standard entity corresponding to the standard vector with the highest vector similarity is used as the target standard entity corresponding to the preset entity, or the standard entity corresponding to the standard vector with the vector similarity exceeding the similarity threshold is used as the target standard entity corresponding to the preset entity.

In an alternative embodiment, the model architecture of the target semantic extraction model is a BioLinkBERT pre-training model. The BioLinkBERT pre-training model is trained by using a pubMed document with a primer link, and compared with a general-field pre-training language model, the medical entity semantic features can be captured by the pre-training language model oriented to the medical vertical field.

On the basis of the above embodiment, optionally, the method further includes: and performing fine tuning training on the target semantic extraction model by using a contrast learning model Simcse to obtain a fine-tuned target semantic extraction model. The arrangement has the advantages that preset entities and standard entities with similar semantics can be closer in the vector space, preset entities and standard entities with different semantics are farther in the vector space, and the alignment accuracy of the preset entities and the standard database is further improved.

In another embodiment, specifically, whether a preset alignment algorithm is adopted is judged, whether the standard entity is matched with the standard entity in the standard database is judged, if not, whether the standard entity aligned with the preset entity is inquired based on the first mapping list is judged, if not, the preset entity is input into a target semantic extraction model which is trained in advance and is respectively input into each standard entity in the standard database, a preset vector corresponding to the preset entity and a standard vector corresponding to each standard entity are obtained, and the target standard entity aligned with the preset entity is determined based on the preset vector and each standard vector.

Fig. 4 is a flowchart of a specific entity of a method for identifying and aligning preset entities according to an embodiment of the present invention. Specifically, each preset variation document in the preset variation document set is input into a target entity recognition model which is trained in advance, and an output preset entity set is obtained. For each preset entity in the preset entity set, determining whether the standard entity aligned with the preset entity is matched by adopting a regular expression, if so, taking the matched standard entity as the aligned preset entity, if not, determining whether the standard entity aligned with the preset entity can be queried based on the first mapping list, if so, taking the queried standard entity as the aligned preset entity, if not, respectively inputting the preset entity and each standard entity in the standard database into a pre-trained target semantic extraction model, calculating the vector similarity of the output preset vector and each standard vector, and taking the standard entity corresponding to the standard vector with the vector similarity exceeding the similarity threshold as the target standard entity corresponding to the preset entity.

The same entity expresses the same semantics in different variant documents, but the expression forms are various to perform alignment operation on the preset entity, so that the normalization of the document knowledge graph constructed later can be improved, and the comprehensiveness and accuracy of the subsequent retrieval result can be improved.

Fig. 5 is a flowchart of another method for searching for a variation document according to an embodiment of the present invention, where the method for determining the reference variation document and the search weight value in the above embodiment is further refined. As shown in fig. 5, the method includes:

s310, in response to receiving the search data, constructing a search combination set based on at least one search entity in the search data.

On the basis of the above embodiment, optionally, before constructing the search combination set based on at least one search entity in the search data, the method further includes: and aiming at each retrieval entity in the retrieval data, executing an alignment operation on the retrieval entity to obtain an aligned retrieval entity.

In this embodiment, the search data includes at least a search gene, and correspondingly, constructing a search combination set based on at least one search entity in the search data includes: adding the retrieval genes in the retrieval data as a first retrieval combination into a retrieval combination set; in the case where the search data further includes a search amino acid, adding the search gene and the search amino acid as a second search combination to the search combination set; in the case where the search data further includes a search nucleotide, the search gene and the search nucleotide are added as a third search combination to the search combination set.

Fig. 6 is a flowchart of a method for determining a search combination set according to an embodiment of the present invention. Specifically, the obtained search data at least comprises a search gene, the search gene is used as a first search combination, whether the search data further comprises search amino acid or not is judged, if yes, the search gene and the search amino acid are used as a second search combination, and the step of judging whether the search data further comprises search nucleotide is continuously executed; if the search data is not, the step of judging whether the search data further includes a search nucleotide is continued. If the search data further includes a search nucleotide, the search gene and the search nucleotide are set as a third search combination, and if the search data does not include a search amino acid, the process is ended.

For example, assuming that the search data includes gene 1, disease 1, amino acid 1, and nucleotide 2, the search combination set contains [ gene 1], [ gene 1 amino acid 1], and [ gene 1 nucleotide 2].

S320, for each search combination in the search combination sets, a search variation set is determined based on the search combination.

In this embodiment, the search variation set includes at least one search variation type, and the search variation set is a search GD variation set or a search GPN variation set. Specifically, when the search mutation set is a search GD mutation set, the search mutation type is a search GD mutation, and when the search mutation set is a search GPN mutation set, the search mutation type is a search GPN mutation. Wherein, the GD variation characterizes a combination of a predetermined gene and a predetermined disease, and the GPN variation characterizes a combination of a predetermined gene, a predetermined amino acid, and a predetermined nucleotide.

In an alternative embodiment, determining a set of search variations based on a search combination includes: when the search combination set comprises a first search combination, judging whether at least one inquiry disease which has a relation with search genes in the first search combination exists in the document knowledge graph or not; if yes, determining a search GD variation set based on the search genes and the query diseases; if not, determining to search the GPN variation set based on the search gene; wherein, searching each preset entity corresponding to each first GPN variation in the GPN variation set comprises searching genes.

Specifically, based on the search genes, inquiring the literature knowledge graph to obtain at least one inquiry disease. Taking fig. 2 as an example, assuming that the search gene is gene 1, each query disease includes disease 1, disease 2, and disease 3, and correspondingly, the search GD variation set includes g1+d1, g1+d2, and g1+d3. Assuming that the search gene is gene 4, no query disease related to gene 4 exists in the literature knowledge graph, and the search GPN variant set contains G4+P2+N3 correspondingly.

In another alternative embodiment, determining a set of search variations based on a search combination includes: when the search combination set comprises a second search combination, determining a search GPN variation set based on the second search combination; wherein, each preset entity corresponding to each second GPN variation in the GPN variation set comprises a search gene and a search amino acid; when the search combination set comprises a third search combination, determining a search GPN variation set based on the third search combination; wherein, searching each preset entity corresponding to each third GPN variation in the GPN variation set comprises searching genes and searching nucleotides.

Taking fig. 2 as an example, assuming that the second search combination set includes gene 1 and amino acid 1, the search GPN variation set includes g1+p1+n1 and g1+p1+n2, and assuming that the third search combination set includes gene 1 and nucleotide 3, the search GPN variation set includes g1+p2+n3.

Fig. 7 is a flowchart of a method for determining a search variant set according to an embodiment of the present invention. Specifically, each search combination is respectively judged, when a first search combination ([ Gm ]) is included in the search combination set, whether a query disease Dx corresponding to Gm exists in a query document knowledge graph or not is judged, and if so, search GD variation Gm+dx determined based on a search gene Gm and the query disease Dx is added into the search GD variation set; if not, the search GPN variation Gm+Px+Ny determined based on the search gene Gm is added to the search GPN variation set.

When the search combination set contains the second search combination ([ GmPn ]), the search GPN variation Gm+Pn+Nx determined based on the search gene Gm and the search amino acid Pn is added to the search GPN variation set, and when the search combination set contains the third search combination ([ GmNn ]), the search GPN variation Gm+Px+Nn determined based on the search gene Gm and the search nucleotide Nn is added to the search GPN variation set.

Wherein'm' and 'n' represent specific entities in the search data, and 'x' and 'y' represent preset entities queried based on a literature knowledge graph, and the entities are not limited.

S330, determining at least one reference variation document and a retrieval weight value corresponding to each reference variation document based on the document knowledge graph and the retrieval variation set.

In an alternative embodiment, determining at least one reference variation document and a search weight value corresponding to each reference variation document based on the document knowledge graph and the search variation set includes: aiming at each retrieval variation type in the retrieval variation set, taking the association relation corresponding to the retrieval variation type in the document knowledge graph as a matching association relation, and taking the preset variation document corresponding to each matching association relation as a reference variation document; and acquiring at least one matching association relation corresponding to each reference variation document according to each reference variation document, and determining a retrieval weight value corresponding to the reference variation document based on preset weight values corresponding to the matching association relations in the document knowledge graph.

In one embodiment, when the number of matching association relationships corresponding to the reference mutation document is one, a preset weight value corresponding to the matching association relationship is used as a retrieval weight value corresponding to the reference mutation document.

In another embodiment, when the number of the matching association relationships corresponding to the reference mutation document is at least two, the statistical value of the preset weight value corresponding to each matching association relationship is used as the retrieval weight value corresponding to the reference mutation document. Wherein, the statistical value is exemplified as a maximum value, a minimum value, a median value or a mean value, etc., and the statistical value is not limited herein.

Taking fig. 2 as an example, assuming that the first search combination is gene 1 and disease 1, the search GD variation set includes g1+d1, each reference variation document includes document 1 and document 2, and the search weight values corresponding to document 1 and document 2 are 0.51 and 0.9, respectively. Assuming that the second search combination is gene 1 and amino acid 1, the search GPN mutant set includes g1+p1+n1 and g1+p1+n2, each reference mutant document includes document 1, document 2 and document 3, and the search weights corresponding to document 1, document 2 and document 3 are 0.7, 0.46 and 0.5, respectively.

Assuming that the search variant set includes g1+p2+n3 and g4+p2+n3, and the statistical values are average values, each reference variant document includes document 3, and since the preset statistical values corresponding to the two matching association relations of document 3 are 0.8 and 0.91, respectively, the search weight value corresponding to document 3 is 0.885.

S340, sorting the reference variant documents based on the retrieval weight values to obtain a reference sorting result corresponding to the retrieval combination.

S350, determining a document retrieval result corresponding to the retrieval data based on the reference sorting results respectively corresponding to the retrieval combinations.

In an optional embodiment, the number of combinations of the search combinations in the search combination set is at least two, and accordingly, before determining the document search result corresponding to the search data based on the reference ranking result respectively corresponding to each search combination, the method further includes: deleting the repeated mutation literature from the reference sorting results corresponding to the retrieval combination with lower priority for each repeated mutation literature under the condition that at least one repeated mutation literature exists in each reference sorting result, and obtaining a screened reference sorting result; or deleting the repeated variation document with smaller retrieval weight value from the corresponding reference sorting result to obtain the screened reference sorting result.

In an alternative embodiment, the priorities corresponding to the first search combination, the third search combination and the second search combination respectively increase sequentially.

For example, the reference ranking results after screening corresponding to the first search combination, the second search combination, and the third search combination are [ document 1, document 2], [ document 3, document 4, document 5], and [ document 6], respectively, and the document search result is [ document 3, document 4, document 5, document 1, and document 2].

According to the technical scheme of the embodiment, the search genes in the search data are used as the first search combination, the search genes and the search amino acids are used as the second search combination when the search data further comprise the search amino acids, the search genes and the search nucleotides are used as the third search combination when the search data further comprise the search nucleotides, the search variation set is determined based on the search combination, the search weight value corresponding to at least one reference variation document and each reference variation document is determined based on the document knowledge graph and the search variation set, the problem that the number of variation documents in the document search result is too small is solved, the matching degree of the document search result and the user search requirement is ensured, and meanwhile, as many related variation documents are used as the document search result as possible, so that the comprehensiveness and the search efficiency of document search are further improved.

Fig. 8 is a flowchart of another method for searching variant documents according to an embodiment of the present invention, where the reference ranking result in the above embodiment is further refined. As shown in fig. 8, the method includes:

s410, in response to receiving the search data, constructing a search combination set based on at least one search entity in the search data, and sequentially acquiring each search combination in the search combination set.

In this embodiment, the search data includes at least a search gene and a search transcript, and accordingly, a search combination set is constructed based on at least one search entity in the search data, and further includes: in the case where the search data further includes a search amino acid, adding the search gene, the search amino acid, and the search transcript as a fourth search combination to the search combination set; in the case where the search data does not include the search amino acid but further includes the search nucleotide, the search gene, the search nucleotide, and the search transcript are added as a fourth search combination to the search combination set.

Fig. 9 is a flowchart of another method for determining a search combination set according to an embodiment of the present invention, specifically, the obtained search data includes at least a search gene, the search gene is used as a first search combination, whether the search data further includes a search amino acid is determined, and if yes, the search gene and the search amino acid are used as a second search combination. In one aspect, the method further comprises determining whether the search data further includes a search transcript, and if so, using the search gene, the search amino acid, and the search transcript as a fourth search combination. On the other hand, it is continued to judge whether the search data further includes a search nucleotide, and if the search nucleotide is further included, the search gene and the search nucleotide are combined as a third search, and if the search nucleotide is not included, the process is ended.

If the search data does not include the search amino acid, continuing to judge whether the search data further includes the search nucleotide, if not, ending, if yes, taking the search gene and the search nucleotide as a third search combination, and continuing to judge whether the search data further includes the search transcript, and if further includes the search transcript, taking the search gene, the search nucleotide and the search transcript as a fifth search combination. If the retrieved transcript is not included, then it ends.

For example, assuming that the search data includes gene 1, amino acid 1, nucleotide 2, and transcript 1, the search set contains [ gene 1], [ gene 1 amino acid 1 transcript 1], and [ gene 1 nucleotide 2]. The search data includes gene 1, nucleotide 2, and transcript 1, and the search set contains [ gene 1], [ gene 1 nucleotide 2 transcript 1], and [ gene 1 nucleotide 2].

S420, judging whether the search combination is a fourth search combination or a fifth search combination, if so, executing S430, and if not, executing S450.

S430, determining a search standard variation set based on the search combination.

In this example, the detection standard variation in the search standard variation set characterizes a combination of a predetermined gene, a predetermined amino acid, a predetermined transcript (Sequence), and a predetermined nucleotide. In the standard database map, the mutation names of the standard mutation are composed of preset genes, preset amino acids, preset transcripts and preset nucleotides, and specific numbers are also included. The variation name of the standard variation is "NM_004004.6 (GJB2): c.670A > C (p.Lys 224Gln)", numbered 44765.

Specifically, when the search combination is the fourth search combination, each preset entity corresponding to each search standard variation in the search standard variation set includes a search gene, a search amino acid and a search transcript, and when the search combination is the fifth search combination, each preset entity corresponding to each search standard variation in the search standard variation set includes a search gene, a search nucleotide and a search transcript.

In an alternative embodiment, after determining the set of search criteria variations based on the search combination, the method further comprises: when the search combination is a fifth search combination, under the condition that the search standard variation set is an empty set, acquiring target HGVS variation corresponding to the search nucleotide and the search transcript in the fifth search combination; based on the second mapping list, adding the standard variation corresponding to the target HGVS variation as a search standard variation in the search standard variation set; wherein the second mapping list characterizes a mapping relationship between at least one preset HGVS variation and standard variation, respectively, and the preset HGVS variation characterizes a combination of preset nucleotides and preset transcripts.

Specifically, the standard database provides a second mapping list, and in the case that the search standard variation is not found by the fifth search combination, the target HGVS variation can be uniquely located by the search nucleotide and the search transcript in the fifth search combination, and the search standard variation corresponding to the fifth search combination is found by the second mapping list.

The arrangement has the advantages that the problem that the variation of the search standard cannot be found through the fifth search combination is solved, and the detection rate of a subsequent standard database search engine is further ensured.

S440, determining a reference ordering result corresponding to the retrieval combination based on the retrieval standard variation set by adopting a standard database search engine, and executing S470.

The standard database search engine is specifically used for executing search operation on a standard database map provided by the standard database and outputting a reference sorting result. By way of example, a standard database search engine may be a search engine provided for the Clinvar database.

S450, determining at least one reference variation document corresponding to the search combination and search weight values corresponding to the reference variation documents respectively based on a pre-constructed document knowledge graph.

S460, sorting the reference variant documents based on the retrieval weight values to obtain a reference sorting result corresponding to the retrieval combination.

S470, determining a document retrieval result corresponding to the retrieval data based on the reference sorting results respectively corresponding to the retrieval combinations.

In an alternative embodiment, the priorities of the fourth search combination and the fifth search combination are highest, and the priorities of the second search combination, the third search combination and the first search combination respectively decrease in sequence.

Fig. 10 is a flowchart of a specific example of a method for searching a variant document according to an embodiment of the present invention. Specifically, entity recognition is performed on each preset variation document in a preset variation document set to obtain preset entities, relation extraction is performed on preset entity pairs constructed based on the preset entities to obtain preset entity pairs with relation, at least one variation type associated with the preset variation document is determined based on the entity relation corresponding to each preset entity pair, for each variation type, a preset weight value of the association relation between the preset variation document and the variation type is determined based on the occurrence position of the variation type in the preset variation document, and a document knowledge graph is constructed based on the association relation between each preset variation document and at least one variation type and the preset weight value corresponding to each association relation.

Based on the search data (NM_ 020779.4 (WDR 35): c.1844A > G (p.Glu645Gly)), a search combination set is determined, and based on each search combination in the search combination set, a standard database map and/or a pre-built literature knowledge map is searched to obtain a reference ranking result corresponding to each search combination. In fig. 10, each reference ranking result includes ranking result 1 and ranking result 2, the first two variant documents and the matching fragments thereof in ranking result 1 and ranking result 2 are input into the evidence classification model to obtain the matching probability corresponding to the target variant document, and 4 documents are ranked in descending order based on each matching probability to obtain the document retrieval result corresponding to the retrieval combination.

According to the technical scheme, when the search combination is the fourth search combination or the fifth search combination, the search standard variation set is determined based on the search combination, the standard database search engine is adopted, the reference sorting results corresponding to the search combination are determined based on the search standard variation set, the document search results corresponding to the search combination are determined based on the reference sorting results, the problem that the number of variant documents in the document search results is too small is solved, the matching degree of the document search results and the user search requirement is ensured, and the variant documents searched by the standard database map are added to the document search results, so that the comprehensiveness and the search efficiency of document search are further improved.

Fig. 11 is a schematic structural diagram of a variant document searching device according to an embodiment of the present invention. As shown in fig. 11, the apparatus includes: the search combination set construction module 510, the reference variation document determination module 520, the reference ranking result determination module 530, and the document search result determination module 540.

Wherein, the search combination set construction module 510 is configured to construct a search combination set based on at least one search entity in the search data in response to receiving the search data;

A reference mutation document determining module 520, configured to determine, for each search combination in the search combination set, at least one reference mutation document corresponding to the search combination and a search weight value corresponding to each reference mutation document, respectively, based on a pre-constructed document knowledge graph;

the reference ranking result determining module 530 is configured to rank each reference variant document based on each search weight value to obtain a reference ranking result corresponding to the search combination;

a document retrieval result determining module 540, configured to determine a document retrieval result corresponding to the retrieval data based on the reference ranking results respectively corresponding to the retrieval combinations;

the literature knowledge graph comprises at least one association relation corresponding to at least one preset variation literature and at least one variation type respectively and a preset weight value corresponding to each association relation respectively, and the variation type represents a combination form of at least one preset entity.

According to the technical scheme, through pre-constructing a variation knowledge graph comprising at least one association relation of preset variation documents and at least one variation type and preset weight values corresponding to the association relations, a retrieval combination set is constructed based on at least one retrieval entity in received retrieval data, for each retrieval combination in the retrieval combination set, based on the pre-constructed document knowledge graph, at least one reference variation document corresponding to the retrieval combination and the retrieval weight values corresponding to the reference variation documents are determined, based on the retrieval weight values, the reference variation documents are ranked to obtain reference ranking results corresponding to the retrieval combination, and based on the reference ranking results, document retrieval results corresponding to the retrieval data are determined, so that the problem that the retrieval results given by a traditional variation retrieval engine are poor in referenceability is solved, the variation documents which are more in line with the retrieval requirements of users are ranked in front, and the efficiency of document retrieval is improved.

On the basis of the above embodiment, optionally, the apparatus further includes:

the document knowledge graph construction module is used for acquiring at least one variation type associated with each preset variation document in the preset variation document set;

determining a preset weight value of an association relation between a preset mutation document and a mutation type according to the occurrence position of the mutation type in the preset mutation document for each mutation type;

and constructing a literature knowledge graph based on the association relation between each preset variation literature and at least one variation type and the preset weight value corresponding to each association relation.

On the basis of the above embodiment, optionally, the document knowledge graph construction module includes:

the preset entity pair construction unit is used for acquiring at least two preset entities corresponding to the preset variation literature and constructing at least one preset entity pair based on each preset entity;

the entity relation determining unit is used for inputting the preset entity pairs and the entity fragments corresponding to the preset entity pairs in the preset variation literature into a relation extraction model which is trained in advance for each preset entity pair to obtain an output entity relation;

The variation type determining unit is used for determining at least one variation type associated with the preset variation document based on the entity relation corresponding to each preset entity pair.

Wherein each preset entity pair comprises at least one entity pair formed by preset genes and preset diseases and at least one entity pair formed by preset genes, preset amino acids and preset nucleotides.

On the basis of the above embodiment, optionally, the mutation type includes GD mutation and/or GPN mutation, and the corresponding mutation type determining unit is specifically configured to:

when the preset entity pair is an entity pair formed by a preset gene and a preset disease, constructing GD variation associated with a preset variation document based on the preset entity pair under the condition that the entity relationship of the preset entity pair is a relationship;

when each preset entity pair comprises entity pairs formed by preset genes, preset amino acids and preset nucleotides, and the entity relationships corresponding to the entity pairs formed by the preset genes, the preset amino acids and the preset nucleotides are all in a relationship, the GPN variation related to the preset variation literature is constructed based on the preset genes, the preset amino acids and the preset nucleotides.

the preset entity alignment module is used for respectively executing alignment operation on each preset entity based on each standard entity in the standard database by adopting a preset alignment strategy before constructing at least one preset entity pair based on each preset entity to obtain at least one aligned preset entity; the preset alignment strategy comprises at least one of a preset alignment algorithm, a first mapping list and a vector similarity algorithm, wherein the first mapping list represents at least one preset entity corresponding to each standard entity in the standard database.

On the basis of the above embodiment, optionally, the search data includes at least a search gene, and the corresponding search combination set construction module 510 is specifically configured to:

adding the retrieval genes in the retrieval data as a first retrieval combination into a retrieval combination set;

in the case where the search data further includes a search amino acid, adding the search gene and the search amino acid as a second search combination to the search combination set;

in the case where the search data further includes a search nucleotide, the search gene and the search nucleotide are added as a third search combination to the search combination set.

Based on the above embodiment, optionally, each mutation type in the literature knowledge graph includes GD mutation and/or GPN mutation, and the reference mutation literature determining module 520 includes:

a search variation set determination unit configured to determine a search variation set based on the search combination; the search variation set comprises at least one search variation type, and the search variation set is a search GD variation set or a search GPN variation set;

and the retrieval weight value determining unit is used for determining at least one reference variation document and the retrieval weight value corresponding to each reference variation document respectively based on the document knowledge graph and the retrieval variation set.

On the basis of the above embodiment, optionally, the search variation set determining unit includes:

a search GD variation set determination subunit configured to determine, when the search combination set includes the first search combination, whether or not there is at least one query disease in the document knowledge graph that has a relationship with the search genes in the first search combination;

if yes, determining a search GD variation set based on the search genes and the query diseases;

if not, determining to search the GPN variation set based on the search gene; wherein, searching each preset entity corresponding to each first GPN variation in the GPN variation set comprises searching genes.

a search GPN variation set determination subunit configured to determine a search GPN variation set based on the second search combination when the search combination set includes the second search combination; wherein, each preset entity corresponding to each second GPN variation in the GPN variation set comprises a search gene and a search amino acid;

when the search combination set comprises a third search combination, determining a search GPN variation set based on the third search combination; wherein, searching each preset entity corresponding to each third GPN variation in the GPN variation set comprises searching genes and searching nucleotides.

On the basis of the above embodiment, optionally, the search combination set includes a fourth search combination of a search gene, a search amino acid, and a search transcript, or the search combination set includes a fifth search combination of a search gene, a search nucleotide, and a search transcript, and accordingly, the apparatus further includes:

the search standard variation set determining module is used for determining a search standard variation set based on the search combination when the search combination is a fourth search combination or a fifth search combination;

determining a reference ordering result corresponding to the search combination based on the search standard variation set by adopting a standard database search engine;

Wherein, the detection standard variation in the search standard variation set characterizes the combination form of a preset gene, a preset amino acid, a preset transcript and a preset nucleotide.

the search standard variation adding module is used for acquiring target HGVS variation corresponding to search nucleotides and search transcripts in a fifth search combination under the condition that the search standard variation set is an empty set when the search combination is the fifth search combination after the search standard variation set is determined based on the search combination;

based on the second mapping list, adding the standard variation corresponding to the target HGVS variation as a search standard variation in the search standard variation set;

wherein the second mapping list characterizes a mapping relationship between at least one preset HGVS variation and standard variation, respectively, and the preset HGVS variation characterizes a combination of preset nucleotides and preset transcripts.

On the basis of the above embodiment, optionally, the document retrieval result determining module 540 is specifically configured to:

respectively acquiring target variation documents with preset sequencing quantity from each reference sequencing result, and acquiring matching fragments corresponding to the retrieval combination in each target variation document;

Inputting the matching segments corresponding to the target mutation documents and the search combination into a pre-trained evidence classification model aiming at each target mutation document to obtain the matching probability corresponding to the target mutation document;

and sorting all the target variant documents in a descending order based on all the matching probabilities to obtain document retrieval results corresponding to the retrieval combinations.

The device for searching the variant document provided by the embodiment of the invention can execute the method for searching the variant document provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. The electronic device 10 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 12, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the respective methods and processes described above, such as the retrieval method of the variant document provided in the above embodiment.

In some embodiments, the method for retrieving the variant document provided in the above embodiments may be implemented as a computer program, which is tangibly embodied on a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps in the above-described retrieval method of the variant document may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the retrieval method of the variant document in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method for searching a mutation document is characterized by comprising the following steps:

the literature knowledge graph comprises at least one association relation corresponding to at least one preset variation literature and at least one variation type respectively and preset weight values corresponding to the association relations respectively, wherein the variation type represents a combination form of at least one preset entity, the variation type comprises GD variation and/or GPN variation, the GD variation represents a combination form of a preset gene and a preset disease, and the GPN variation represents a combination form of a preset gene, a preset amino acid and a preset nucleotide;

the search data at least comprises a search gene, and the construction of a search combination set based on at least one search entity in the search data comprises the following steps:

in the case where the search data further includes a search amino acid, adding the search gene and the search amino acid as a second search combination to a search combination set;

In the case where the search data further includes a search nucleotide, adding the search gene and the search nucleotide as a third search combination to a search combination set;

the determining, based on the pre-constructed document knowledge graph, at least one reference mutation document corresponding to the search combination and a search weight value corresponding to each reference mutation document, includes:

determining a search variation set based on the search combination; the search mutation set comprises at least one search mutation type, and the search mutation set is a search GD mutation set or a search GPN mutation set;

and determining at least one reference variation document and a retrieval weight value corresponding to each reference variation document respectively based on the document knowledge graph and the retrieval variation set.

2. The method according to claim 1, wherein the method further comprises:

for each preset variation document in a preset variation document set, acquiring at least one variation type associated with the preset variation document;

determining a preset weight value of an association relation between the preset mutation document and the mutation type according to the occurrence position of the mutation type in the preset mutation document for each mutation type;

3. The method of claim 2, wherein the obtaining at least one variation type associated with the preset variation document comprises:

acquiring at least two preset entities corresponding to the preset variation document, and constructing at least one preset entity pair based on each preset entity;

inputting the preset entity pairs and entity fragments corresponding to the preset entity pairs in the preset variation literature into a pre-trained relation extraction model aiming at each preset entity pair to obtain an output entity relation;

determining at least one variation type associated with the preset variation document based on the entity relation corresponding to each preset entity pair;

4. The method of claim 3, wherein determining at least one mutation type associated with the preset mutation document based on the respective entity relationships of each of the preset entity pairs comprises:

When the preset entity pair is an entity pair formed by a preset gene and a preset disease, constructing GD variation associated with the preset variation document based on the preset entity pair under the condition that the entity relationship of the preset entity pair is a relationship;

when each preset entity pair comprises entity pairs formed by preset genes, preset amino acids and preset nucleotides, and the entity relationships corresponding to the entity pairs formed by the preset genes, the preset amino acids and the preset nucleotides are all relationships, GPN variation related to the preset variation literature is constructed based on the preset genes, the preset amino acids and the preset nucleotides.

5. A method according to claim 3, wherein prior to constructing at least one preset entity pair based on each of the preset entities, the method further comprises:

adopting a preset alignment strategy, and respectively executing alignment operation on each preset entity based on each standard entity in a standard database to obtain at least one aligned preset entity; the preset alignment strategy comprises at least one of a preset alignment algorithm, a first mapping list and a vector similarity algorithm, wherein the first mapping list represents at least one preset entity corresponding to each standard entity in the standard database.

6. The method of claim 1, wherein the determining a search variation set based on the search combination comprises:

when the search combination set contains a first search combination, judging whether at least one inquiry disease which has a relation with search genes in the first search combination exists in the document knowledge graph or not;

if not, determining a search GPN variation set based on the search gene; wherein, each preset entity corresponding to each first GPN variation in the search GPN variation set comprises a search gene.

7. The method of claim 1, wherein the determining a search variation set based on the search combination comprises:

determining a search GPN variation set based on a second search combination when the search combination set includes the second search combination; wherein, each preset entity corresponding to each second GPN variation in the search GPN variation set comprises a search gene and search amino acid;

when the search combination set comprises a third search combination, determining a search GPN variation set based on the third search combination; wherein, each preset entity corresponding to each third GPN variation in the search GPN variation set comprises a search gene and a search nucleotide.

8. The method of claim 1, wherein the search combination set comprises a fourth search combination of a search gene, a search amino acid, and a search transcript, or wherein the search combination set comprises a fifth search combination of a search gene, a search nucleotide, and a search transcript, and wherein the method further comprises:

when the search combination is a fourth search combination or a fifth search combination, determining a search standard variation set based on the search combination;

determining a reference ordering result corresponding to the retrieval combination based on the retrieval standard variation set by adopting a standard database search engine;

wherein the detection standard variation in the search standard variation set characterizes a combination form of a preset gene, a preset amino acid, a preset transcript and a preset nucleotide.

9. The method of claim 8, wherein after determining a set of search criteria variations based on the search combination, the method further comprises:

when the search combination is a fifth search combination, under the condition that the search standard variation set is an empty set, acquiring target HGVS variation corresponding to search nucleotides and search transcripts in the fifth search combination;

Based on a second mapping list, adding standard variation corresponding to the target HGVS variation as search standard variation to the search standard variation set;

10. The method according to any one of claims 1 to 9, wherein determining a document search result corresponding to each search combination based on the reference ranking result corresponding to the search combination, respectively, comprises:

inputting the matching segments corresponding to the target variation documents and the search combination into a pre-trained evidence classification model aiming at each target variation document to obtain the matching probability corresponding to the target variation document;

and sorting all the target variant documents in a descending order based on the matching probability to obtain document retrieval results corresponding to the retrieval combination.

11. A mutation document search device is characterized by comprising:

The search data at least comprises a search gene, and the search combination set construction module is specifically used for:

the reference variation determination module comprises:

a search variation set determination unit configured to determine a search variation set based on the search combination; the search mutation set comprises at least one search mutation type, and the search mutation set is a search GD mutation set or a search GPN mutation set;

and the retrieval weight value determining unit is used for determining at least one reference variation document and retrieval weight values respectively corresponding to the reference variation documents based on the document knowledge graph and the retrieval variation set.

12. An electronic device, the electronic device comprising:

At least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of retrieving variant documents according to any one of claims 1-10.

13. A computer readable storage medium storing computer instructions for causing a processor to implement the method of retrieving variant documents according to any one of claims 1 to 10 when executed.