CN113010746B

CN113010746B - Medical record graph sequence retrieval method and system based on sub-tree inverted index

Info

Publication number: CN113010746B
Application number: CN202110294328.9A
Authority: CN
Inventors: 王晓黎; 黄烨钒
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2021-03-19
Filing date: 2021-03-19
Publication date: 2023-08-29
Anticipated expiration: 2041-03-19
Also published as: CN113010746A

Abstract

The invention relates to a medical chart sequence retrieval method and system based on sub-tree inverted indexes, firstly, constructing three-layer inverted indexes of a medical chart sequence database based on a sub-tree decomposition algorithm; secondly, based on the sub-tree inverted index table and the size table, a sub-tree approximation query algorithm is adopted to obtain a sub-tree approximation table corresponding to each sub-tree structure; then, based on the inverted index table of the graph structure and the subtree approximation table corresponding to each subtree structure, obtaining the graph structure approximation table corresponding to each graph structure by adopting a graph structure approximation query algorithm; and finally, obtaining the graph sequence approximation table corresponding to each graph sequence by adopting a graph sequence approximation query algorithm based on the graph sequence inverted index table and the graph structure approximation table corresponding to each graph structure. The invention combines the three-layer inverted index with the approximate query algorithm, establishes the connection between the multi-mode data, and carries out case approximate search on the basis, thereby improving the accuracy of the search.

Description

Medical record graph sequence retrieval method and system based on sub-tree inverted index

Technical Field

The invention relates to the field of medical chart sequence retrieval, in particular to a medical chart sequence retrieval method and system based on sub-tree inverted indexes.

Background

Along with the rapid development of information technology, the data forms of various industries are more and more vivid and diversified, and multi-mode data such as text, pictures, audio and video with rich contents are generated.

Because the multi-mode data has diversity, complexity and randomness, the structured unified management is difficult to realize. Second, there is often a correlation between the data, and only mining these potential correlations can realize the true value of the data. Traditional database technology often aims at processing data in a single mode, and data in different modes are represented by using different complex data models, such as character strings, trees, graphs, high-dimensional data, dynamic sequences and the like. These methods cannot represent the relevance between multimodal data and cannot meet the comprehensive demands of people on information retrieval. Some proposals of cross-media unified indexing technology solve the problem of cross-domain query with part of data with obvious semantic relevance. However, these solutions are only applicable to social media data with high data relevance, etc., but cannot process medical health data with ambiguous semantic relationships. Because the data base is not firm, the analysis result is often of little significance and lacks practicality. Therefore, how to perform effective unified modeling and indexing on multi-modal data with fuzzy semantic association is an important scientific problem to be solved by the research.

Furthermore, the structure and content of multimodal medical health data is not a constant but rather deductions and changes over time. For example, electronic medical records often contain multiple medical records of a patient, with the data structures and content produced by each record often not being fixed; the health information such as body temperature and the like collected by the medical mobile platform are more different along with the change of the physical state of the user. This dynamic nature of the analytical data is of great importance both in the prediction of patient condition and in the monitoring of user health. Existing medical big data analysis methods cannot describe such dynamic properties of data, often requiring the use of complex machine learning algorithms to analyze and predict dynamic deductions between data. The analysis results tend to be too low in accuracy due to the limitations of artificial intelligence in complex, variable, dynamic environments. Therefore, how to design a new dynamic model to accurately describe the deduction and change of medical health data over time is a critical scientific problem that the present research 66 needs to solve.

Disclosure of Invention

The invention aims to provide a medical record chart sequence retrieval method and system based on sub-tree inverted index so as to improve the accuracy rate of case searching.

In order to achieve the above object, the present invention provides a medical chart sequence retrieval method based on sub-tree inverted index, the method comprising:

step S1: constructing three layers of inverted indexes of a medical record chart sequence database based on a subtree decomposition algorithm; the three-layer inverted index table comprises a graph sequence inverted index table, a graph structure inverted index table and a subtree inverted index table;

step S2: obtaining a graph sequence to be queried, wherein the graph sequence comprises a plurality of graph structures, the graph structure is decomposed into a sub-tree sequence, the sub-tree sequence comprises a plurality of sub-tree structures, each sub-tree structure is decomposed into a node sequence, and the node sequence comprises a plurality of nodes;

step S3: given a subtree structure t to be queried _q Is a size table of (2);

step S4: based on the sub-tree inverted index table and the size table, a sub-tree approximation query algorithm is adopted to obtain a sub-tree approximation table corresponding to each sub-tree structure;

step S5: based on the inverted index table of the graph structure and the subtree approximation table corresponding to each subtree structure, obtaining the graph structure approximation table corresponding to each graph structure by adopting a graph structure approximation query algorithm;

step S6: and obtaining a graph sequence approximation table corresponding to each graph sequence by adopting a graph sequence approximation query algorithm based on the graph sequence inverted index table and the graph structure approximation table corresponding to each graph structure.

Optionally, the step S1 specifically includes:

step S11: decomposing each medical record graph sequence into a medical record graph structure sequence, and establishing a graph sequence inverted index table corresponding to all the medical record graph sequences by taking each medical record graph structure in the medical record graph structure sequence as an index;

step S12: decomposing each medical record graph structure into a medical record sub-tree sequence, taking each medical record sub-tree structure in the medical record sub-tree sequence as an index, and establishing a graph structure inverted index table corresponding to all medical record graph structures;

step S13: and decomposing each medical record sub-tree structure into a medical record node sequence, and establishing a sub-tree inverted index table corresponding to all the medical record sub-tree structures by taking each medical record node in the medical record node sequence as an index.

Optionally, the step S4 specifically includes:

step S41: accessing the sub-tree inverted index table to obtain a sub-tree ordering table corresponding to each node;

step S42: classifying sub-tree ordered tables according to size table, sub-tree ordered tables smaller than size table using alpha=2|l _q Calculation of α; each sub-tree ordered table greater than or equal to the size table utilizes α= - |l _q Calculating α by | - (t (β) -2×τ); wherein alpha represents the structure t of each subtree and the subtree to be queried _q Is a function of the approximate distance of (a),L _q representing the subtree structure t to be queried _q T (beta) represents the number of common leaf tags, τ represents the last seen size value in the size table;

step S43: accessing the current subtree structure in the subtree sequencing table, and judging whether alpha is larger than the subtree approximate distance maximum in the subtree approximate table; if alpha is larger than the sub tree approximation distance maximum in the sub tree approximation table, stopping subsequent access, and outputting a sub tree approximation table corresponding to each sub tree structure; if alpha is smaller than or equal to the maximum subtree approximation distance in the subtree approximation table, adding the currently accessed subtree structure in the subtree sorting table into the subtree approximation table, and accessing the next subtree structure in the subtree sorting table until alpha is larger than the maximum subtree approximation distance in the subtree approximation table; each sub-tree approximation table comprises k1 sub-tree structures, the sub-tree structures in the sub-tree approximation table being subsequently referred to as approximated sub-trees.

Optionally, the step S5 specifically includes:

step S51: accessing the subtree approximation tables row by row, and combining and sorting the inverted index tables of the graph structure corresponding to k1 approximation subtrees of the subtree structure to obtain a sorting table of the graph structure;

Step S52: by means ofCalculating a subtree approximate distance assessment sum Γ, where M represents the total number of subtree sorted lists, Θ _j Representing the subtree approximate distance of the current access location in the j-th subtree sorted list;

step S53: accessing the current graph structure in the graph structure sorting table, and judging whether the sub-tree approximate distance evaluation sum is larger than the maximum graph structure approximate distance in the graph structure approximate table; if the access distance is larger than the maximum value of the graph structure approximation distances in the graph structure approximation table, stopping subsequent access, and outputting the graph structure approximation table corresponding to each graph structure; if the estimated total sub-tree approximate distance is less than or equal to the maximum approximate distance of the graph structures in the graph structure approximate table, adding each graph structure into the graph structure approximate table, and accessing the next graph structure in the graph structure sorting table until the estimated total sub-tree approximate distance is greater than the maximum approximate distance of the graph structures in the graph structure approximate table; each graph structure approximation table includes k2 graph structures, and the graph structures in the graph structure approximation table are hereinafter referred to as approximation graph structures.

Optionally, the step S6 specifically includes:

step S61: accessing the graph structure approximation table row by row, combining and sorting the graph sequence inverted index tables corresponding to the k2 approximation graph structures of the graph structure to obtain a graph sequence sorting table;

Step S62: by means ofCalculating the graph structure approximate distance evaluation sum K, wherein N represents the total number of the graph structure sorting table and omega _k Representing the graph structure approximation distance of the current access position in the kth graph structure sorting table;

step S63: accessing the current graph sequence in the graph sequence sorting table, and judging whether the graph structure approximate distance evaluation sum is larger than the maximum graph sequence alignment distance in the graph sequence approximate table; if the alignment distance of the graph sequence is greater than the maximum alignment distance of the graph sequence in the graph sequence approximation table, stopping subsequent access, and outputting a graph sequence approximation table corresponding to the graph sequence; if the graph structure approximate distance evaluation sum is smaller than or equal to the maximum graph sequence alignment distance in the graph sequence approximate table, adding each graph sequence into the graph sequence approximate table, and accessing the next graph sequence in the graph sequence ordering table until the graph structure approximate distance evaluation sum is larger than the maximum graph structure approximate distance in the graph sequence approximate table; each graph sequence approximation table includes k3 graph sequences, and the graph sequences in the graph sequence approximation table are hereinafter referred to as approximation graph sequences.

The invention also provides a medical record chart sequence retrieval system based on the sub-tree inverted index, which comprises:

the three-layer inverted index construction module is used for constructing three-layer inverted indexes of the medical record chart sequence database based on a subtree decomposition algorithm; the three-layer inverted index table comprises a graph sequence inverted index table, a graph structure inverted index table and a subtree inverted index table;

The system comprises an acquisition module, a query module and a query module, wherein the acquisition module is used for acquiring a graph sequence to be queried, the graph sequence comprises a plurality of graph structures, the graph structure is decomposed into a sub-tree sequence, the sub-tree sequence comprises a plurality of sub-tree structures, each sub-tree structure is decomposed into a node sequence, and the node sequence comprises a plurality of nodes;

a given module for giving subtree structure t to be queried _q Is a size table of (2);

the subtree approximation table query module is used for acquiring subtree approximation tables corresponding to all subtree structures by adopting a subtree approximation query algorithm based on the subtree inverted index table and the size table;

the diagram structure approximation table query module is used for obtaining diagram structure approximation tables corresponding to all diagram structures by adopting a diagram structure approximation query algorithm based on the diagram structure inverted index table and the subtree approximation tables corresponding to all subtree structures;

and the diagram sequence approximate table query module is used for obtaining the diagram sequence approximate table corresponding to each diagram sequence by adopting a diagram sequence approximate query algorithm based on the diagram sequence inverted index table and the diagram structure approximate table corresponding to each diagram structure.

Optionally, the three-layer inverted index building module specifically includes:

the chart sequence inverted index table construction unit is used for decomposing each chart sequence into a chart structure sequence, taking each chart structure in the chart structure sequence as an index, and establishing chart sequence inverted index tables corresponding to all chart sequences;

The map structure inverted index table construction unit is used for decomposing each medical record map structure into a medical record sub-tree sequence, taking each medical record sub-tree structure in the medical record sub-tree sequence as an index, and establishing map structure inverted index tables corresponding to all medical record map structures;

the sub tree inverted index table construction unit is used for decomposing each medical record sub tree structure into a medical record node sequence, taking each medical record node in the medical record node sequence as an index, and establishing sub tree inverted index tables corresponding to all medical record sub tree structures.

Optionally, the subtree approximation table query module specifically includes:

the subtree ordering table determining unit is used for accessing the subtree inverted index table to obtain a subtree ordering table corresponding to each node;

an approximate distance calculation unit for classifying each sub-tree ordered list according to the size list, wherein each sub-tree ordered list smaller than the size list uses alpha=2×l _q Calculation of α; each sub-tree ordered table greater than or equal to the size table utilizes α= - |l _q Calculating α by | - (t (β) -2×τ); wherein alpha represents the structure t of each subtree and the subtree to be queried _q Is the approximate distance of L _q Representing the subtree structure t to be queried _q T (beta) represents the number of common leaf tags, τ represents the last seen size value in the size table;

The first judging unit is used for accessing the current subtree structure in the subtree sequencing table and judging whether alpha is larger than the maximum subtree approximate distance in the subtree approximate table; if alpha is larger than the sub tree approximation distance maximum in the sub tree approximation table, stopping subsequent access, and outputting a sub tree approximation table corresponding to each sub tree structure; if alpha is smaller than or equal to the maximum subtree approximation distance in the subtree approximation table, adding the currently accessed subtree structure in the subtree sorting table into the subtree approximation table, and accessing the next subtree structure in the subtree sorting table until alpha is larger than the maximum subtree approximation distance in the subtree approximation table; each sub-tree approximation table comprises k1 sub-tree structures, the sub-tree structures in the sub-tree approximation table being subsequently referred to as approximated sub-trees.

Optionally, the graph structure approximation table query module specifically includes:

the diagram structure ordering table determining unit is used for accessing the subtree approximation tables row by row, and combining and ordering the diagram structure inverted index tables corresponding to k1 approximation subtrees of the subtree structure to obtain a diagram structure ordering table;

subtree approximate distance estimation sum determining unit for utilizingCalculating a subtree approximate distance assessment sum Γ, where M represents the total number of subtree sorted lists, Θ _j Sub-tree approximations representing the current access position in the j-th sub-tree ranking tableSeparating;

the second judging unit is used for accessing the current graph structure in the graph structure sorting table and judging whether the sub-tree approximate distance evaluation sum is larger than the maximum graph structure approximate distance in the graph structure approximate table; if the access distance is larger than the maximum value of the graph structure approximation distances in the graph structure approximation table, stopping subsequent access, and outputting the graph structure approximation table corresponding to each graph structure; if the estimated total sub-tree approximate distance is less than or equal to the maximum approximate distance of the graph structures in the graph structure approximate table, adding each graph structure into the graph structure approximate table, and accessing the next graph structure in the graph structure sorting table until the estimated total sub-tree approximate distance is greater than the maximum approximate distance of the graph structures in the graph structure approximate table; each graph structure approximation table includes k2 graph structures, and the graph structures in the graph structure approximation table are hereinafter referred to as approximation graph structures.

Optionally, the graph sequence approximation table query module specifically includes:

the diagram sequence ordering table determining unit is used for accessing the diagram structure approximation table row by row, combining and ordering the diagram sequence inverted index tables corresponding to the k2 approximation diagram structures of the diagram structure, and obtaining a diagram sequence ordering table;

Graph structure approximate distance evaluation sum determining unit for utilizingCalculating the graph structure approximate distance evaluation sum K, wherein N represents the total number of the graph structure sorting table and omega _k Representing the graph structure approximation distance of the current access position in the kth graph structure sorting table;

the third judging unit is used for accessing the current graph sequence in the graph sequence sorting table and judging whether the graph structure approximate distance evaluation sum is larger than the maximum graph sequence alignment distance in the graph sequence approximate table; if the alignment distance of the graph sequence is greater than the maximum alignment distance of the graph sequence in the graph sequence approximation table, stopping subsequent access, and outputting a graph sequence approximation table corresponding to the graph sequence; if the graph structure approximate distance evaluation sum is smaller than or equal to the maximum graph sequence alignment distance in the graph sequence approximate table, adding each graph sequence into the graph sequence approximate table, and accessing the next graph sequence in the graph sequence ordering table until the graph structure approximate distance evaluation sum is larger than the maximum graph structure approximate distance in the graph sequence approximate table; each graph sequence approximation table includes k3 graph sequences, and the graph sequences in the graph sequence approximation table are hereinafter referred to as approximation graph sequences.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

The invention relates to a medical chart sequence retrieval method and system based on sub-tree inverted indexes, firstly, constructing three-layer inverted indexes of a medical chart sequence database based on a sub-tree decomposition algorithm; secondly, based on the sub-tree inverted index table and the size table, a sub-tree approximation query algorithm is adopted to obtain a sub-tree approximation table corresponding to each sub-tree structure; then, based on the inverted index table of the graph structure and the subtree approximation table corresponding to each subtree structure, obtaining the graph structure approximation table corresponding to each graph structure by adopting a graph structure approximation query algorithm; and finally, obtaining the graph sequence approximation table corresponding to each graph sequence by adopting a graph sequence approximation query algorithm based on the graph sequence inverted index table and the graph structure approximation table corresponding to each graph structure. The invention combines the three-layer inverted index with the approximate query algorithm, establishes the connection between the multi-mode data, and carries out case approximate search on the basis, thereby improving the accuracy of searching cases.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a diagram illustrating a patient medical record chart sequence according to an embodiment of the present invention;

FIG. 2 is a tree diagram of an embodiment of the present invention;

FIG. 3 is a diagram showing an example of the calculation of the subtree approximate distance according to the embodiment of the invention;

FIG. 4 is a diagram illustrating an example of the calculation of the approximate distance of the structure according to an embodiment of the present invention;

FIG. 5 is a flowchart showing a complete example of an embodiment of the present invention;

FIG. 6 is a flowchart of a medical chart sequence retrieval method based on sub-tree inverted indexes according to an embodiment of the invention;

FIG. 7 is a chart of the result of the run-time of the medical record approximate search algorithm according to the embodiment of the invention;

FIG. 8 is a graph of the accuracy results of the medical record approximate search algorithm according to an embodiment of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

As shown in fig. 5-6, the present invention provides a medical chart sequence retrieval method based on sub-tree inverted index, which comprises:

step S3: given a subtree structure t to be queried _q Is a size table of (2);

The steps are discussed in detail below:

step S1: constructing three layers of inverted indexes of a medical record chart sequence database based on a subtree decomposition algorithm; the three-layer inverted index table comprises a graph sequence inverted index table, a graph structure inverted index table and a subtree inverted index table. The medical chart sequence database comprises a plurality of medical chart sequences;

step S11: decomposing each medical chart sequence into a medical chart structure sequence, taking each medical chart structure in the medical chart structure sequence as an index, and establishing a chart sequence inverted index table corresponding to all medical chart sequences, wherein the specific implementation process is as follows:

patient i-given medical chart sequenceWherein g _j Is the medical record graph structure of patient i at the jth admission, |G ⁱ I represents the medical chart sequence G ⁱ The number of medical record graph structures (i.e. the number of times of admission to patient i), all the medical record graph structures in the medical record graph sequence are ordered according to the actual time of admission to the patient (e.g. medical record g) ₁ Is earlier than the medical record g ₂ Is a hospital admission time). As can be easily seen from the representation form of the medical chart sequence, the medical chart sequence comprises medical chart structures corresponding to a patient's serial admission record, each medical chart structure represents the medical chart structure corresponding to the patient's admission at a certain time, thus the medical chart sequence can be intuitively decomposed into a sequence of chart structures, the medical chart structures are used as indexes, and the lower part of each medical chart structure index comprises one The list is inverted to store information about the sequence of figures. Each item in the inverted list stores an identifier of a graph sequence and the frequency with which a corresponding graph structure appears in the graph sequence.

As shown in FIG. 1, a sequence database of medical records containing two patients is given, denoted as { G ] ¹ ,G ² }, wherein the medical chart sequence G ¹ Comprising two admission records, each represented as a chart structure g ₁ And g ₂ The method comprises the steps of carrying out a first treatment on the surface of the Medical record chart sequence G ² Also comprises two admission records respectively expressed as a medical record graph structure g ₁ And g ₃ . Table 1 shows the inverted index of chart sequence constructed based on the chart sequence database, chart structure g ₁ And g ₂ G is an index structure ₁ The inverted list of indexes contains two elements { G } ¹ 1 and { G } ² 1, respectively represent the medical chart sequence G ¹ Contains 1 medical record graph structure g ₁ And medical record chart sequence G ² Contains 1 medical record graph structure g ₁ ；g ₂ The inverted list of indexes contains one element { G } ¹ 1, representing a medical chart sequence G ¹ Contains 1 medical record graph structure g ₂ ；g ₃ The inverted list of indexes contains one element { G } ² 1, representing a medical chart sequence G ² Contains 1 medical record graph structure g ₃ 。

TABLE 1 sequence inverted index

Step S12: decomposing each medical record graph structure into a medical record sub-tree sequence, taking each medical record sub-tree structure in the medical record sub-tree sequence as an index, and establishing a graph structure inverted index table corresponding to all medical record graph structures, wherein the specific implementation process is as follows:

A medical chart structure g in a medical chart sequence of i medical charts of a given patient _j Structure g of medical record graph _j Decomposition into a sequence of medical record subtreesWherein t is _m Represents a medical record subtree structure, |g _j I represents the medical record diagram structure g _j The number of nodes in the middle medical record (i.e., g _j The number of subtrees resulting from the decomposition).

The invention takes a medical record graph structure as an example to decompose the medical record graph structure, traverses each medical record node in the medical record node set, searches for medical record nodes with the degree number not being 0, and forms a medical record sub-tree structure with the medical record nodes and the degree number thereof, namely the number of edges pointing to other medical record nodes from the medical record node. An inverted index is established for all graph structures using subtrees and inverted lists. Each item in the inverted list contains an identifier of the graph structure and the frequency with which the corresponding subtree appears in the graph structure, all lists being ordered in order of increasing graph structure size.

Traversing all medical record graph structures g owned in FIG. 1, respectively ₁ ，g ₂ And g ₃ . First traversing medical record node list { n ] owned by medical record graph structure g ₁ ，n ₂ ，n ₃ ，n ₄ Each medical record node in }, first n ₁ From n ₁ Two edges pointing to other medical record nodes are { e }, respectively ₁ ，e ₂ Extract and { e } ₁ ，e ₂ The related medical record nodes form the medical record sub-tree structure t in FIG. 2 ₁ The method comprises the steps of carrying out a first treatment on the surface of the Then check n ₂ Due to n ₂ Edges that do not point to other medical record nodes, therefore n is skipped ₂ The method comprises the steps of carrying out a first treatment on the surface of the Same n ₃ Nor edges pointing to other medical record nodes, skip n ₃ The method comprises the steps of carrying out a first treatment on the surface of the Finally check n ₄ ，n ₄ Having a direction n ₂ Edge e of medical record node ₃ Extracting and e ₃ The associated medical record nodes form the medical record sub-tree structure t in FIG. 2 ₂ 。

Then go through g ₂ Owned medical record node list { n } ₁ ，n ₂ ，n ₃ ，n ₅ Each medical record node in }, first n ₁ From n ₁ Two edges pointing to other medical record nodes are { e }, respectively ₁ ，e ₂ Extract and { e } ₁ ，e ₂ Diseases associated withThe calendar nodes form a subtree, which is identical to t in FIG. 2 ₁ Identical, no repeated additions are made; then check n ₂ Due to n ₂ Edges that do not point to other medical record nodes, therefore n is skipped ₂ The method comprises the steps of carrying out a first treatment on the surface of the Same n ₃ Nor edges pointing to other medical record nodes, skip n ₃ The method comprises the steps of carrying out a first treatment on the surface of the Finally check n ₅ ，n ₅ Having two directions n ₂ Edge e of medical record node ₃ Extracting and e ₃ The associated medical record nodes form the medical record sub-tree structure t in FIG. 2 ₃ 。

Finally traversing the medical record graph structure g ₃ Owned medical record node list { n } ₁ ，n ₂ ，n ₃ ，n ₅ Each medical record node in }, first n ₁ From n ₁ Two edges pointing to other medical record nodes are { e }, respectively ₁ ，e ₂ Extract and { e } ₁ ，e ₂ The nodes of the medical record relating to form a subtree, which is identical to t in FIG. 2 ₁ Identical, no repeated additions are made; then check n ₂ Due to n ₂ Edges that do not point to other medical record nodes, therefore n is skipped ₂ The method comprises the steps of carrying out a first treatment on the surface of the Same n ₃ Nor edges pointing to other medical record nodes, skip n ₃ The method comprises the steps of carrying out a first treatment on the surface of the Finally check n ₅ ，n ₅ Having a direction n ₂ Edge e of medical record node ₃ Extracting and e ₃ The associated medical record nodes form the medical record sub-tree structure t in FIG. 2 ₄ 。

All medical chart structures g as obtained by the medical chart sequence in fig. 1 ₄ ，g ₂ And g ₃ The sub-tree structure t of four medical records in FIG. 2 can be obtained through segmentation ₁ ，t ₂ ，t ₃ And t ₄ Table 2 shows the inverted index of the graph structure constructed based on the structure of the medical record sub-tree, freq being the frequency of the corresponding sub-tree.

Table 2 structural inverted index

This sub-tree inverted index consists of two main parts: index for all the different medical record nodes, and an inverted list under each medical record node. Each entry in the inverted list contains an identifier of the sub-tree and the frequency with which the corresponding medical record node appears in the sub-tree. All lists are ordered in order of increasing subtree size.

The medical record sub-tree structure t as in FIG. 2 ₁ ，t ₂ ，t ₃ And t ₄ Table 3 shows the sub-tree inverted index based on the medical record node, medical record node n ₁ ,…,n ₁₀ For indexing structures, each medical record sub-tree structure is further broken down into units (i.e., vertices and edges) and an index is built in an inverted list. The index also contains two components: a label index arranged alphabetically increasing in order of the label names of the medical records nodes (if n ₁ The label of (2) is diabetes, n ₂ Is influenza, then n ₂ Arranged at n ₁ Front) and an inverted list below the tags, recording the subtree identity and the frequency of the corresponding tag. The entries in each list are first grouped by leaf size and then sorted in decreasing frequency in each group. A so-called leaf node, i.e. a medical record node, does not have any edges pointing to other medical record nodes. For example, the first column of Table 3 is the label { t } for each medical record sub-tree structure ₁ ，t ₂ ，t ₃ ，t ₄ The second column is the number of leaf nodes owned by the medical record sub-tree structure, such as the medical record sub-tree structure t in FIG. 2 ₁ Having 2 leaf nodes { n } ₂ ，n ₃ Medical record subtree structure t ₂ Having 1 leaf node n ₂ Medical record subtree structure t ₃ Having 2 leaf nodes { n } ₂ ，n ₂ Medical record subtree structure t ₄ Having 1 leaf node n ₂ . Column 1 in Table 3 is the sub-tree tag and column 2 is the medical record node n ₂ The number of occurrences in the subtree due to n ₂ At subtree t ₃ The frequency of occurrence is highest, son ₂ T in the table ₃ Located in the first row.

TABLE 3 sub-tree inverted index

Step S2: and obtaining a graph sequence to be queried, wherein the graph sequence comprises a plurality of graph structures, the graph structure is decomposed into a sub-tree sequence, the sub-tree sequence comprises a plurality of sub-tree structures, each sub-tree structure is decomposed into a node sequence, and the node sequence comprises a plurality of nodes.

The subtree structure is represented by a triplet { r, L, L }, where r is the root node of the subtree structure, L is the set of leaf nodes, and L is the tag function, e.g., the subtree structure t in FIG. 3 ₃ Root node n of (2) ₅ Its label is diabetes, subtree structure t ₄ Root node n of (2) ₅ Its label is also diabetes.

Calculating the subtree approximate distance between all the subtree structures, wherein the specific formula is as follows:

λ(t ₁ ,t ₂ )＝T(r ₁ ,r ₂ )+d(L ₁ ,L ₂ )

d(L ₁ ,L ₂ )＝||L ₁ |-|L ₂ ||+M(L ₁ ,L ₂ )

wherein λ (t) ₁ ,t ₂ ) Representing a subtree structure t ₁ And subtree structure t ₂ Subtree approximation distance between, T (r ₁ ,r ₂ ) The value of (2) depends on l (r) ₁ ) And l (r) ₂ ) Equal to no, if l (r ₁ )＝l(r ₂ ) I.e. the labels of the two root nodes agree, then T (r ₁ ,r ₂ ) =0, otherwise T (r ₁ ,r ₂ ) =1, l denotes the tag function, r ₁ Representing a subtree structure t ₁ Root node of r ₂ Is a subtree structure t ₂ Root node of L ₁ Representing a subtree structure t ₁ Is set of leaf nodes, L ₂ Representing a subtree structure t ₂ Is set of leaf nodes, d (L ₁ ,L ₂ ) Represents L ₁ And L is equal to ₂ The distance between the two sets is such that,representing subtree t ₁ A set of labels corresponding to each leaf node,/->Representing subtree t ₂ A set of labels corresponding to leaf nodes of (c).

As shown in fig. 3, a subtree structure t is calculated ₃ And subtree structure t ₄ Subtree approximation distance T (r) ₃ ,r ₄ ) Obviously, l (r ₃ )＝l(r ₄ )＝n ₅ T (r) ₃ ,r ₄ ) =0. Known subtree structure t ₃ Owned leaf node list L ₃ ＝{n ₂ ,n ₂ List size |l }, therefore ₃ |＝2,|L ₄ |＝1,Therefore, λ (t) ₃ ,t ₄ )＝0+|2-1|+2-1＝2。

Step S3: given a subtree structure t to be queried _q Is a size table of (c).

Table 4 size table

Step S4: based on the sub-tree inverted index table and the size table, a sub-tree approximation query algorithm is adopted to obtain a sub-tree approximation table corresponding to each sub-tree structure, and the method specifically comprises the following steps:

step S42: for each child according to the size tableThe tree ordering tables are classified, and each sub-tree ordering table smaller than the size table uses alpha=2×l _q Calculation of α; each sub-tree ordered table greater than or equal to the size table utilizes α= - |l _q Calculating α by | - (t (β) -2×τ); wherein alpha represents the structure t of each subtree and the subtree to be queried _q Is the approximate distance of L _q Representing the subtree structure t to be queried _q T (beta) represents the number of common leaf tags and τ represents the last seen size value in the size table.

Let t be _q There are m frequencies { f ₁ ,f ₂ ,……,f _m The specific calculation formula for calculating the number of the common leaf tags is as follows:

wherein beta is _j Representing the subtree structure t to be queried _q Subtree structure t in jth subtree ordered list _i If the frequency of the subtree structure t _i Not present in the ordered list of subtrees, beta _j =0, if the subtree structure t _i Appear in the sub-tree ordered list, then beta _j ＝1，f _j For subtree structure t to be queried _q Is the j-th frequency of (c).

The invention can obtain the subtree structure which is highly similar to the subtree to be queried, namely the subtree approximation table by utilizing the subtree approximation query algorithm.

The specific formula for calculating the subtree approximate distance is as follows:

λ(t _q ,t _i )＝T(r ₁ ,r ₂ )+d(L ₁ ,L ₂ )

d(L ₁ ,L ₂ )＝||L ₁ |-|L ₂ ||+M(L ₁ ,L ₂ )

wherein λ (t) _q ,t _i ) Representing the subtree structure t to be queried _q And subtree structure t _i Subtree approximation distance between, T (r ₁ ,r ₂ ) The value of (2) depends on l (r) ₁ ) And l (r) ₂ ) Equal to no, if l (r 1 ₎ ＝l(r ₂ ) I.e. the labels of the two root nodes agree, then T (r ₁ ,r ₂ ) =0, otherwise T (r ₁ ,r ₂ ) =1, l denotes the tag function, r ₁ Representing the subtree structure t to be queried _q Root node of r ₂ Is a subtree structure t _i Root node of L ₁ Representing the subtree structure t to be queried _q Is set of leaf nodes, L ₂ Representing a subtree structure t _i Is set of leaf nodes, d (L ₁ ,L ₂ ) Represents L ₁ And L is equal to ₂ The distance between the two sets is such that,representing the subtree structure t to be queried _q A set of labels corresponding to each leaf node,/->Representing a subtree structure t _i A set of labels corresponding to leaf nodes of (c).

In fig. 4, the left matrix is a sub-tree sequence T (g ₁ ) And T (g) ₂ ) A matrix of subtree distances between, the gray grid representing a sequence of subtrees T (g ₁ ) And subtree sequence T (g) ₂ ) Is the best match of (i) μ (g ₁ ,g ₂ ) =0+0+9=9. For clarity of representation, g is represented by a sequence of subtrees on the right ₁ 、g ₂ And the best match is indicated by solid arrows.

Step S5: based on the inverted index table of the graph structure and the subtree approximation table corresponding to each subtree structure, the graph structure approximation table corresponding to each graph structure is obtained by adopting a graph structure approximation query algorithm, and the method specifically comprises the following steps:

step S52: by means ofCalculating a subtree approximate distance assessment sum Γ, where M represents the total number of subtree sorted lists, Θ _j Representing the subtree approximate distance of the current access location in the j-th sub-tree ranking table.

The specific formula for calculating the approximate distance of the graph structure is as follows:

wherein P represents a sub-tree sequence T (g _q )→T(g _i ) Is of the subtree structure t _i Belonging to the sub-tree sequence T (g) _i )，λ(t _i ,P(t _i ) Is t) _i And P (t) _i ) The approximate distance of the two sub-tree structures, P (t _i ) G is g _q Medium and subtree structure t _i Aligned sub-tree structures.

By means ofCalculating the approximate distance mu (g) of the graph structure _q ,g _i ) Lower limit of (2); wherein, the liquid crystal display device comprises a liquid crystal display device,representing the diagram structure g _i The distance, κ (g), of the first sub-tree structure in the j-th sub-tree ordering table _q ,g _i ) The distance mu (g) is approximated for the graph structure _q ,g _i ) The specific formula is:

S _j representing the diagram structure g _i In the structure g of the diagram to be queried _q A local minimum subtree approximation distance below the j-th subtree sorted list,containing graph structure g below the jth list _i Is a subtree approximation distance,/>e _x Is the graph structure g _i Is the x-th sub-tree approximation distance.

If kappa (g) _q ,g _i ) Greater than the uppermost k values, g can be safely filtered out _i 。

Step S6: based on the graph sequence inverted index table and the graph structure approximation table corresponding to each graph structure, the graph sequence approximation table corresponding to each graph sequence is obtained by adopting a graph sequence approximation query algorithm, and the method specifically comprises the following steps:

Step S62: by means ofCalculating the graph structure approximate distance evaluation sum K, wherein N represents the total number of the graph structure sorting table and omega _k Representing the approximate distance of the graph structure for the current access location in the k-th graph structure sorted list.

Dynamic time warping is used to measure the similarity between two sequences of graphs, represented by graph alignment distances. The definition of the graph alignment distance is as follows: given two graph sequences G ¹ And G ² And its graph structure set, and a bijection P: G ¹ →G ² ,G ¹ And G ² With an alignment distance of ω (G) ¹ ,G ² ) The formula for specifically calculating the alignment distance of the graph sequence is shown as follows:

wherein ω (G) ^q ,G ⁱ² ) Representing the sequence G ¹ Sum diagram sequence G ² Between (a) and (b)Graph sequence alignment distance, P represents graph sequence G ^q →G ⁱ Is used for the dual-shot of the laser beam,for the graph sequence G ⁱ One of the segmented graph structures->Is->And->Approximate distance between the two graph structures, < >>Is G ^q Middle and->Aligned graph structures.

ω(G ¹ ,G ² ) The computation of (2) corresponds to solving the problem of sequence alignment, which is usually solved using a dynamic time warping algorithm, with the goal of finding the smallest weight match in a given cost matrix.

By means ofCalculating a graph sequence alignment distance ω (G) ^q ,G ⁱ ) Lower bound kappa (G) ^q ,G ⁱ ) Wherein->Is the graph sequence G in the j-th ordered list ⁱ Mapping distances to the first discovered graph of (c). Obviously, if kappa (G) ^q ,G ⁱ ) Greater than the uppermost k values, G can be safely filtered out ⁱ A sequence of figures.

the three-layer inverted index construction module is used for constructing three-layer inverted indexes of the medical record chart sequence database based on a subtree decomposition algorithm; the three-layer inverted index table comprises a graph sequence inverted index table, a graph structure inverted index table and a subtree inverted index table.

The system comprises an acquisition module, a query module and a query module, wherein the acquisition module is used for acquiring a graph sequence to be queried, the graph sequence comprises a plurality of graph structures, the graph structure is decomposed into a sub-tree sequence, the sub-tree sequence comprises a plurality of sub-tree structures, each sub-tree structure is decomposed into a node sequence, and the node sequence comprises a plurality of nodes.

A given module, configured to give a size table of the subtree structure t_q to be queried.

And the subtree approximation table query module is used for acquiring subtree approximation tables corresponding to all subtree structures by adopting a subtree approximation query algorithm based on the subtree inverted index table and the size table.

And the diagram structure approximation table query module is used for obtaining the diagram structure approximation table corresponding to each diagram structure by adopting a diagram structure approximation query algorithm based on the diagram structure inverted index table and the subtree approximation table corresponding to each subtree structure.

As an optional implementation manner, the three-layer inverted index building module of the present invention specifically includes:

the chart sequence inverted index table construction unit is used for decomposing each chart sequence into a chart structure sequence, taking each chart structure in the chart structure sequence as an index, and establishing chart sequence inverted index tables corresponding to all chart sequences.

The map structure inverted index table construction unit is used for decomposing each medical record map structure into a medical record sub-tree sequence, taking each medical record sub-tree structure in the medical record sub-tree sequence as an index, and establishing map structure inverted index tables corresponding to all medical record map structures.

As an optional implementation manner, the subtree approximation table query module of the present invention specifically comprises:

and the subtree ordering table determining unit is used for accessing the subtree inverted index table to obtain the subtree ordering table corresponding to each node.

An approximate distance calculation unit for classifying each sub-tree ordered list according to the size list, wherein each sub-tree ordered list smaller than the size list uses alpha=2×l _q Calculation of α; each sub-tree ordered table greater than or equal to the size table utilizes α= - |l _q Calculating α by | - (t (β) -2×τ); wherein alpha represents the structure t of each subtree and the subtree to be queried _q Is the approximate distance of L _q Representing the subtree structure t to be queried _q T (beta) represents the number of common leaf tags and τ represents the last seen size value in the size table.

As an optional implementation manner, the graph structure approximation table query module specifically comprises:

and the diagram structure ordering table determining unit is used for accessing the subtree approximation tables row by row, and combining and ordering the diagram structure inverted index tables corresponding to the k1 approximation subtrees of the subtree structure to obtain the diagram structure ordering table.

Subtree approximate distance estimation sum determining unit for utilizingCalculating a subtree approximate distance assessment sum Γ, where M represents the total number of subtree sorted lists, Θ _j Representing the subtree approximate distance of the current access location in the j-th sub-tree ranking table.

As an optional implementation manner, the graph sequence approximation table query module specifically comprises:

and the diagram sequence ordering table determining unit is used for accessing the diagram structure approximation tables row by row, and combining and ordering the diagram sequence inverted index tables corresponding to the k2 approximation diagram structures of the diagram structure to obtain the diagram sequence ordering table.

Graph structure approximate distance evaluation sum determining unit for utilizingCalculating the graph structure approximate distance evaluation sum K, wherein N represents the total number of the graph structure sorting table and omega _k Representing the approximate distance of the graph structure for the current access location in the k-th graph structure sorted list.

In order to more clearly and intuitively check experimental results, the running time results are displayed in two modes of a data table and a histogram. Table 5 lists the results of the runtime data of the CSM-KNN algorithm and the Feature-KNN algorithm on both data sets, and the results of the runtime comparison of the CSM-KNN algorithm and the Feature-KNN algorithm on data set MIMIMIIC-III are shown using the graph (a) in FIG. 7, and the results of the runtime comparison of the CSM-KNN algorithm and the Feature-KNN algorithm on data set OH are shown in the graph (b) in FIG. 7. From fig. 7, it can be concluded that the medical record approximate search algorithm studied by the present invention can efficiently support practical applications.

TABLE 5 comparison of the runtime (sec) results of CSM-KNN algorithm and Feature-KNN algorithm

In order to more clearly and intuitively check the experimental result, the accuracy comparison result is displayed in two modes of a data table and a histogram. Table 6 lists the average accuracy data results of the CSM-KNN algorithm and the Feature-KNN algorithm on data set MIMIMIIC-III, and shows the accuracy comparison result of the CSM-KNN algorithm and the Feature-KNN algorithm on data set MIMIIC-III using the graph (a) in FIG. 8, and shows the accuracy comparison result of the CSM-KNN algorithm and the Feature-KNN algorithm on data set OH in FIG. 8.

Table 6 average accuracy comparison results of CSM-KNN algorithm and Feature-KNN algorithm

In the present specification, each embodiment is described in a progressive manner, and each embodiment focuses on the difference from other embodiments, and the same approximate parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. A medical chart sequence retrieval method based on sub-tree inverted indexes, which is characterized by comprising the following steps:

step S3: given a subtree structure t to be queried _q Is a size table of (2);

2. The medical chart sequence searching method based on the sub tree inverted index according to claim 1, wherein the step S1 specifically includes:

3. The medical chart sequence searching method based on the sub-tree inverted index according to claim 1, wherein the step S4 specifically includes:

step S42: classifying sub-tree ordered tables according to size table, sub-tree ordered tables smaller than size table using alpha=2|l _q Calculation of α; greater than or equal toEach subtree sorted list equal to the size list uses α= - |l _q Calculating α by | - (t (β) -2×τ); wherein alpha represents the structure t of each subtree and the subtree to be queried _q Is the approximate distance of L _q Representing the subtree structure t to be queried _q T (beta) represents the number of common leaf tags, τ represents the last seen size value in the size table;

4. The medical chart sequence searching method based on the sub tree inverted index according to claim 3, wherein the step S5 specifically comprises:

5. The medical chart sequence searching method based on the sub-tree inverted index according to claim 4, wherein the step S6 specifically includes:

6. A system for retrieving a sequence of medical records based on a sub-tree inverted index, the system comprising:

7. The system for retrieving a sequence of medical records based on a sub-tree inverted index as set forth in claim 6, wherein the three-layer inverted index construction module specifically includes:

8. The system for retrieving a sequence of medical records based on an inverted index of a subtree according to claim 6, wherein the subtree approximation table query module specifically comprises:

9. The system for retrieving a sequence of medical records based on sub-tree inverted indexes according to claim 8, wherein the map structure approximation table query module specifically comprises:

subtree approximate distance estimation sum determining unit for utilizingCalculating a subtree approximate distance assessment sum Γ, where M represents the total number of subtree sorted lists, Θ _j Representing the subtree approximate distance of the current access location in the j-th subtree sorted list;

10. The system for retrieving a sequence of medical records based on sub-tree inverted indexes according to claim 9, wherein the map sequence approximation table query module specifically comprises: