CN113010746A

CN113010746A - Medical record sequence retrieval method and system based on subtree inverted index

Info

Publication number: CN113010746A
Application number: CN202110294328.9A
Authority: CN
Inventors: 王晓黎; 黄烨钒
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2021-03-19
Filing date: 2021-03-19
Publication date: 2021-06-22
Anticipated expiration: 2041-03-19
Also published as: CN113010746B

Abstract

The invention relates to a medical record sequence retrieval method and a medical record sequence retrieval system based on subtree inverted indexes, wherein firstly, three layers of inverted indexes of a medical record sequence database are constructed based on a subtree decomposition algorithm; secondly, based on the subtree inverted index table and the size table, a subtree approximation query algorithm is adopted to obtain a subtree approximation table corresponding to each subtree structure; then based on the graph structure inverted index table and the sub-tree approximate table corresponding to each sub-tree structure, a graph structure approximate table corresponding to each graph structure is obtained by adopting a graph structure approximate query algorithm; and finally, obtaining a graph sequence approximation table corresponding to each graph sequence by adopting a graph sequence approximation query algorithm based on the graph sequence inverted index table and the graph structure approximation table corresponding to each graph structure. By adopting the method and the system, the three-layer inverted index is combined with the approximate query algorithm, the relation between the multi-mode data is established, and the case approximate search is carried out on the basis, so that the search accuracy is improved.

Description

Medical record sequence retrieval method and system based on subtree inverted index

Technical Field

The invention relates to the field of medical record sequence retrieval, in particular to a medical record sequence retrieval method and system based on subtree inverted index.

Background

With the rapid development of information technology, data forms of various industries are more and more vivid and diversified, and multi-mode data such as texts, pictures, audio and video with rich contents are generated.

Due to the diversity, complexity and randomness of the multi-modal data, structured unified management is difficult to realize. Secondly, the data are often related, and the true value of the data can be realized only by mining the potential relations. Traditional database technologies often process data of a single modality, and data of different modalities are represented by different complex data models, such as character strings, trees, graphs, high-dimensional data, dynamic sequences and the like. These methods cannot represent the relevance between multimodal data and cannot meet the comprehensive requirements of people for information retrieval. Some proposals for cross-media unified indexing techniques solve the cross-domain query problem for portions of data with significant semantic relevance. However, these solutions are only applicable to social media data and the like with high data relevance, but cannot process medical health data with fuzzy semantic relations. Because the data base is not firm, the analysis result is often of little significance and lacks practicality. Therefore, how to effectively and uniformly model and index multi-modal data with fuzzy semantic association is an important scientific problem to be solved by the research.

In addition, the structure and content of the multimodal medical health data is not invariable, but may be deduced and changed over time. For example, electronic medical records often contain multiple medical records of a patient, and the data structure and content generated by each record are often not fixed; the health information such as body temperature collected by the medical mobile platform is more different along with the change of the physical state of the user. Analyzing such dynamically changing attributes of the clear data is of great significance in both the prediction of patient condition and the monitoring of user health. The existing medical big data analysis method cannot describe the dynamic attribute of the data, and a complex machine learning algorithm is often needed to analyze and predict the dynamic deduction condition between the data. Due to the limitation of artificial intelligence in processing complex, variable and dynamic environments, the accuracy of analysis results is often too low. Therefore, how to design a new dynamic model to accurately describe the deduction and change of the medical health data over time is a key scientific problem to be solved by this research 66.

Disclosure of Invention

The invention aims to provide a medical record sequence retrieval method and system based on subtree inverted index, so as to improve the accuracy of case search.

In order to achieve the above object, the present invention provides a medical record graph sequence retrieval method based on subtree inverted index, which comprises:

step S1: constructing three-layer inverted indexes of a medical record sequence database based on a subtree decomposition algorithm; the three-layer inverted index table comprises a graph sequence inverted index table, a graph structure inverted index table and a subtree inverted index table;

step S2: obtaining a graph sequence to be queried, wherein the graph sequence comprises a plurality of graph structures, the graph structures are decomposed into a sub-tree sequence, the sub-tree sequence comprises a plurality of sub-tree structures, each sub-tree structure is decomposed into a node sequence, and the node sequence comprises a plurality of nodes;

step S3: giving a subtree structure t to be queried_qSize table of (1);

step S4: based on the subtree inverted index table and the size table, a subtree approximation query algorithm is adopted to obtain a subtree approximation table corresponding to each subtree structure;

step S5: based on the graph structure inverted index table and the sub-tree approximate table corresponding to each sub-tree structure, a graph structure approximate table corresponding to each graph structure is obtained by adopting a graph structure approximate query algorithm;

step S6: and obtaining a graph sequence approximation table corresponding to each graph sequence by adopting a graph sequence approximation query algorithm based on the graph sequence inverted index table and the graph structure approximation table corresponding to each graph structure.

Optionally, the step S1 specifically includes:

step S11: decomposing each medical record graph sequence into a medical record graph structure sequence, and establishing a graph sequence inverted index table corresponding to all the medical record graph sequences by taking each medical record graph structure in the medical record graph structure sequence as an index;

step S12: decomposing each medical record graph structure into a medical record sub-tree sequence, and establishing a graph structure inverted index table corresponding to all medical record graph structures by taking each medical record sub-tree structure in the medical record sub-tree sequence as an index;

step S13: and decomposing each medical record subtree structure into a medical record node sequence, and establishing a subtree inverted index table corresponding to all medical record subtree structures by taking each medical record node in the medical record node sequence as an index.

Optionally, the step S4 specifically includes:

step S41: accessing the subtree inverted index table to obtain a subtree sequencing table corresponding to each node;

step S42: sorting the subtree sorting tables according to the size table, wherein the subtree sorting tables smaller than the size table use alpha-2 x | L_qComputing alpha by | - (t (β) + τ); ordering tables for subtrees greater than or equal to size table using α ═ L_qComputing α, | - (t (β) -2 ×); wherein, alpha represents the structure t of each sub-tree and the sub-tree to be inquired_qAn approximate distance of L_qRepresenting a structure t of a subtree to be queried_qT (β) represents the number of common leaf labels, τ represents the last seen size value in the size table;

step S43: accessing the current sub-tree structure in the sub-tree sequencing table, and judging whether alpha is larger than the sub-tree approximate distance maximum in the sub-tree approximate table; if alpha is larger than the sub-tree with the maximum approximate distance in the sub-tree approximate table, stopping subsequent access and outputting the sub-tree approximate table corresponding to each sub-tree structure; if alpha is smaller than or equal to the sub-tree approximate distance maximum in the sub-tree approximate table, adding the currently accessed sub-tree structure in the sub-tree ordering table into the sub-tree approximate table, and accessing the next sub-tree structure in the sub-tree ordering table until alpha is larger than the sub-tree approximate distance maximum in the sub-tree approximate table; each subtree approximation table comprises k1 subtree structures, and the subtree structures in the subtree approximation table are subsequently called approximation subtrees.

Optionally, the step S5 specifically includes:

step S51: accessing the subtree approximation table row by row, and combining and sorting the graph structure inverted index tables corresponding to k1 approximation subtrees of the subtree structure to obtain a graph structure sorting table;

step S52: by using

Calculating a subtree approximate distance evaluation sum gamma, wherein M represents the total number of subtree sorting tables, theta_jRepresenting the subtree approximate distance of the current access position in the jth subtree sorting table;

step S53: accessing the current graph structure in the graph structure sorting table, and judging whether the evaluation sum of the approximate distances of the subtrees is greater than the maximum approximate distance of the graph structure in the graph structure approximate table; if the distance is larger than the maximum graph structure approximate distance in the graph structure approximate table, stopping subsequent access and outputting the graph structure approximate table corresponding to each graph structure; if the distance is smaller than or equal to the maximum graph structure approximate distance in the graph structure approximate table, each graph structure is added into the graph structure approximate table, and the next graph structure in the graph structure sorting table is accessed until the evaluation sum of the sub-tree approximate distances is larger than the maximum graph structure approximate distance in the graph structure approximate table; each graph structure approximation table includes k2 graph structures, and the graph structures in the graph structure approximation table are subsequently referred to as approximate graph structures.

Optionally, the step S6 specifically includes:

step S61: accessing the graph structure approximate table line by line, and combining and sorting the graph sequence inverted index tables corresponding to the k2 approximate graph structures of the graph structure to obtain a graph sequence sorting table;

step S62: by using

Calculating a graph structure approximation distance evaluation sum K, where N represents the total number of graph structure sorting tables, ω_kA graph structure approximate distance representing a current visited location in the kth graph structure sorted list;

step S63: accessing the current graph sequence in the graph sequence sorting table, and judging whether the graph structure approximate distance evaluation sum is larger than the maximum graph sequence alignment distance in the graph sequence approximate table; if the distance is larger than the maximum alignment distance of the graph sequence in the graph sequence approximation table, stopping subsequent access and outputting the graph sequence approximation table corresponding to the graph sequence; if the distance is less than or equal to the maximum alignment distance of the graph sequence in the graph sequence approximate table, adding each graph sequence into the graph sequence approximate table, and accessing the next graph sequence in the graph sequence sorting table until the evaluation sum of the graph structure approximate distances is greater than the maximum graph structure approximate distances in the graph sequence approximate table; each map sequence approximation table comprises k3 map sequences, and the map sequences in the map sequence approximation table are subsequently called approximation map sequences.

The invention also provides a medical record image sequence retrieval system based on the subtree inverted index, which comprises:

the three-layer inverted index construction module is used for constructing three-layer inverted indexes of the medical record sequence database based on a subtree decomposition algorithm; the three-layer inverted index table comprises a graph sequence inverted index table, a graph structure inverted index table and a subtree inverted index table;

an obtaining module, configured to obtain a graph sequence to be queried, where the graph sequence includes a plurality of graph structures, the graph structure is decomposed into a sub-tree sequence, the sub-tree sequence includes a plurality of sub-tree structures, and then each sub-tree structure is decomposed into a node sequence, where the node sequence includes a plurality of nodes;

a given module for giving the subtree structure t to be inquired_qSize table of (1);

the sub-tree approximate table query module is used for obtaining a sub-tree approximate table corresponding to each sub-tree structure by adopting a sub-tree approximate query algorithm based on the sub-tree inverted index table and the size table;

the graph structure approximation table query module is used for obtaining a graph structure approximation table corresponding to each graph structure by adopting a graph structure approximation query algorithm based on the graph structure inverted index table and the subtree approximation table corresponding to each subtree structure;

and the graph sequence approximation table query module is used for acquiring a graph sequence approximation table corresponding to each graph sequence by adopting a graph sequence approximation query algorithm based on the graph sequence inverted index table and the graph structure approximation table corresponding to each graph structure.

Optionally, the three-layer inverted index building module specifically includes:

the map sequence inverted index table construction unit is used for decomposing each medical record map sequence into a medical record map structure sequence, taking each medical record map structure in the medical record map structure sequence as an index, and establishing a map sequence inverted index table corresponding to all the medical record map sequences;

the graph structure inverted index table construction unit is used for decomposing each medical record graph structure into a medical record sub-tree sequence, and establishing a graph structure inverted index table corresponding to all the medical record graph structures by taking each medical record sub-tree structure in the medical record sub-tree sequence as an index;

and the subtree inverted index table construction unit is used for decomposing each medical record subtree structure into a medical record node sequence, and establishing subtree inverted index tables corresponding to all medical record subtree structures by taking each medical record node in the medical record node sequence as an index.

Optionally, the sub-tree approximation table querying module specifically includes:

the sub-tree sequencing table determining unit is used for accessing the sub-tree inverted index table to obtain a sub-tree sequencing table corresponding to each node;

an approximate distance calculation unit for classifying the sub-tree ranking tables according to the size table, each sub-tree ranking table smaller than the size table using α -2 × | L_qComputing alpha by | - (t (β) + τ); ordering tables for subtrees greater than or equal to size table using α ═ L_qComputing α, | - (t (β) -2 ×); wherein, alpha represents the structure t of each sub-tree and the sub-tree to be inquired_qAn approximate distance of L_qRepresenting a structure t of a subtree to be queried_qT (β) represents the number of common leaf labels, τ represents the last seen size value in the size table;

the first judging unit is used for accessing the current sub-tree structure in the sub-tree sequencing table and judging whether alpha is larger than the sub-tree approximate distance maximum in the sub-tree approximate table; if alpha is larger than the sub-tree with the maximum approximate distance in the sub-tree approximate table, stopping subsequent access and outputting the sub-tree approximate table corresponding to each sub-tree structure; if alpha is smaller than or equal to the sub-tree approximate distance maximum in the sub-tree approximate table, adding the currently accessed sub-tree structure in the sub-tree ordering table into the sub-tree approximate table, and accessing the next sub-tree structure in the sub-tree ordering table until alpha is larger than the sub-tree approximate distance maximum in the sub-tree approximate table; each subtree approximation table comprises k1 subtree structures, and the subtree structures in the subtree approximation table are subsequently called approximation subtrees.

Optionally, the graph structure approximation table query module specifically includes:

the graph structure ordering table determining unit is used for accessing the subtree approximation table row by row and combining and ordering graph structure inverted index tables corresponding to k1 approximation subtrees of the subtree structure to obtain a graph structure ordering table;

a subtree approximate distance evaluation sum determination unit for utilizing

the second judging unit is used for accessing the current graph structure in the graph structure sorting table and judging whether the evaluation sum of the approximate distances of the subtrees is greater than the maximum approximate distance of the graph structure in the graph structure approximate table; if the distance is larger than the maximum graph structure approximate distance in the graph structure approximate table, stopping subsequent access and outputting the graph structure approximate table corresponding to each graph structure; if the distance is smaller than or equal to the maximum graph structure approximate distance in the graph structure approximate table, each graph structure is added into the graph structure approximate table, and the next graph structure in the graph structure sorting table is accessed until the evaluation sum of the sub-tree approximate distances is larger than the maximum graph structure approximate distance in the graph structure approximate table; each graph structure approximation table includes k2 graph structures, and the graph structures in the graph structure approximation table are subsequently referred to as approximate graph structures.

Optionally, the graph sequence approximation table query module specifically includes:

the graph sequence ordering table determining unit is used for accessing the graph structure approximate table line by line and combining and ordering the graph sequence reverse index tables corresponding to the k2 approximate graph structures of the graph structure to obtain a graph sequence ordering table;

a graph structure approximate distance evaluation sum determination unit for utilizing

a third judging unit, configured to access a current graph sequence in the graph sequence ranking table, and judge whether the graph structure approximate distance evaluation sum is greater than a maximum graph sequence alignment distance in the graph sequence approximate table; if the distance is larger than the maximum alignment distance of the graph sequence in the graph sequence approximation table, stopping subsequent access and outputting the graph sequence approximation table corresponding to the graph sequence; if the distance is less than or equal to the maximum alignment distance of the graph sequence in the graph sequence approximate table, adding each graph sequence into the graph sequence approximate table, and accessing the next graph sequence in the graph sequence sorting table until the evaluation sum of the graph structure approximate distances is greater than the maximum graph structure approximate distances in the graph sequence approximate table; each map sequence approximation table comprises k3 map sequences, and the map sequences in the map sequence approximation table are subsequently called approximation map sequences.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention relates to a medical record sequence retrieval method and a medical record sequence retrieval system based on subtree inverted indexes, wherein firstly, three layers of inverted indexes of a medical record sequence database are constructed based on a subtree decomposition algorithm; secondly, based on the subtree inverted index table and the size table, a subtree approximation query algorithm is adopted to obtain a subtree approximation table corresponding to each subtree structure; then based on the graph structure inverted index table and the sub-tree approximate table corresponding to each sub-tree structure, a graph structure approximate table corresponding to each graph structure is obtained by adopting a graph structure approximate query algorithm; and finally, obtaining a graph sequence approximation table corresponding to each graph sequence by adopting a graph sequence approximation query algorithm based on the graph sequence inverted index table and the graph structure approximation table corresponding to each graph structure. By adopting the method and the system, the three-layer inverted index is combined with the approximate query algorithm, the relation between the multi-mode data is established, and the case approximate search is carried out on the basis, so that the accuracy of case search is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a sequence example diagram of a patient chart according to an embodiment of the invention;

FIG. 2 is a tree diagram illustrating an embodiment of the present invention;

FIG. 3 is a diagram illustrating an exemplary calculation of subtree approximate distances according to an embodiment of the present invention;

FIG. 4 is an exemplary diagram illustrating the calculation of approximate distance of the structure according to the embodiment of the present invention;

FIG. 5 is a flowchart of a full example of an embodiment of the present invention;

FIG. 6 is a flowchart of a medical record graph sequence retrieval method based on inverted indexes of subtrees according to an embodiment of the present invention;

FIG. 7 is a graph of the results of the run time of the approximate medical record search algorithm according to the embodiment of the present invention;

FIG. 8 is a chart showing the result of the accuracy of the medical record approximate search algorithm according to the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

As shown in fig. 5-6, the present invention provides a medical record graph sequence searching method based on subtree inverted index, including:

step S3: giving a subtree structure t to be queried_qSize table of (1);

The individual steps are discussed in detail below:

step S1: constructing three-layer inverted indexes of a medical record sequence database based on a subtree decomposition algorithm; the three-layer inverted index table comprises a graph sequence inverted index table, a graph structure inverted index table and a subtree inverted index table. The medical record map sequence database comprises a plurality of medical record map sequences;

step S11: decomposing each medical record graph sequence into a medical record graph structure sequence, taking each medical record graph structure in the medical record graph structure sequence as an index, and establishing a graph sequence inverted index table corresponding to all the medical record graph sequences, wherein the specific implementation process is as follows:

disease of given patient iCalendar picture sequence

Wherein, g_jIs the chart structure of the patient i's case history at the j-th admission, | GⁱI represents the chart sequence G of the medical recordsⁱThe number of the chart structures (i.e., the number of admissions of patient i) in the middle chart, and all chart structures in the chart sequence are sorted in order according to the actual admission time of the patient (e.g., chart g)₁Is earlier than the case history g₂The time of admission). As is easy to see from the representation form of the medical record graph sequence, the medical record graph sequence comprises medical record graph structures corresponding to a sequence of admission records of a patient, and each medical record graph structure represents a medical record structure corresponding to a certain admission of the patient, so that the medical record graph sequence can be visually decomposed into a sequence of graph structures, and the medical record graph structures are used as indexes, so that the lower part of each medical record graph structure index comprises an inverted list for storing the related information of the graph sequence. Each entry in the posting list stores an identifier of the graph sequence and the frequency with which the corresponding graph structure appears in the graph sequence.

As shown in FIG. 1, given a chart sequence database containing two patients, it is denoted as { G }¹,G²In which, the chart of case history is G¹Comprises two admission records, which are respectively expressed as a medical record structure g₁And g₂(ii) a Medical record chart sequence G²Also comprises two admission records which are respectively represented as a medical record structure g₁And g₃. Table 1 shows the chart sequence inverted index constructed based on the medical record chart sequence database, and the medical record chart structure g₁And g₂As an index structure, g₁Inverted list of index contains two elements G ¹1, and { G }²1, respectively representing the sequence G of medical record graphs¹Containing 1 medical record chart structure g₁And medical record chart sequence G²Containing 1 medical record chart structure g₁；g₂The inverted list of the index contains one element G ¹1, representing a sequence G of medical records¹Containing 1 medical record chart structure g₂；g₃The inverted list of the index contains one element G ²1, representing a sequence G of medical records²Containing 1 medical record chart structure g₃。

TABLE 1 inverted index of FIG. sequence

Step S12: decomposing each medical record graph structure into a medical record sub-tree sequence, establishing a graph structure inverted index table corresponding to all medical record graph structures by taking each medical record sub-tree structure in the medical record sub-tree sequence as an index, and specifically implementing the following processes:

a chart structure g in a sequence of chart entries for a given patient i_jStructure g of medical record_jDecomposing and representing into a sequence of medical record sub-trees

Wherein, t_mRepresents a structure of a sub-tree of a medical record, | g_jI represents the medical record graph structure g_jThe number of nodes in the middle medical record (i.e., g)_jThe number of subtrees obtained by decomposition).

The invention takes a medical record graph structure as an example to decompose the medical record graph structure, traverses each medical record node in a medical record node set, and finds the medical record nodes with the out-degree quantity not being 0, wherein the medical record nodes and the out-degree medical record nodes form a medical record sub-tree structure, and the out-degree quantity is the quantity of edges pointing to other medical record nodes from the medical record nodes. Inverted indexes are built for all graph structures using subtrees and inverted lists. Each entry in the posting list contains an identifier of the graph structure and the frequency with which the corresponding sub-tree appears in the graph structure, with all lists sorted in order of increasing size of the graph structure.

Respectively traversing all medical record graph structures g owned in the graph 1₁，g₂And g₃. Firstly, traversing a medical record node list (n) owned by a medical record graph structure g₁，n₂，n₃，n₄Every case history node in the (1) } is n at first₁From n to n₁Two edges pointing to other medical record nodes are { e }₁，e₂Extracting e and e₁，e₂The related medical record nodes form a medical record sub-tree structure t in FIG. 2₁(ii) a Then checking n₂Due to n₂There are no edges pointing to other medical record nodes, so n is skipped₂(ii) a Same n₃There is no edge pointing to other medical record nodes, skipping n₃(ii) a Last check n₄，n₄Having a direction n₂Edge e of medical record node₃Take out and e₃The related patient history nodes form the patient history sub-tree structure t in FIG. 2₂。

Then go through g₂Owned medical record node list { n₁，n₂，n₃，n₅Every case history node in the (1) } is n at first₁From n to n₁Two edges pointing to other medical record nodes are { e }₁，e₂Extracting e and e₁，e₂The related medical record nodes form a sub-tree, which is related to t in FIG. 2₁The same, no longer repeat the addition; then checking n₂Due to n₂There are no edges pointing to other medical record nodes, so n is skipped₂(ii) a Same n₃There is no edge pointing to other medical record nodes, skipping n₃(ii) a Last check n₅，n₅Having two orientations n₂Edge e of medical record node₃Take out and e₃The related patient history nodes form the patient history sub-tree structure t in FIG. 2₃。

Finally, traversing the medical record graph structure g₃Owned medical record node list { n₁，n₂，n₃，n₅Every case history node in the (1) } is n at first₁From n to n₁Two edges pointing to other medical record nodes are { e }₁，e₂Extracting e and e₁，e₂The related medical record nodes form a sub-tree, which is related to t in FIG. 2₁The same, no longer repeat the addition; then checking n₂Due to n₂There are no edges pointing to other medical record nodes, so n is skipped₂(ii) a Same n₃There is no edge pointing to other medical record nodes, skipping n₃(ii) a Last check n₅，n₅Has aThe bar is pointing to n₂Edge e of medical record node₃Take out and e₃The related patient history nodes form the patient history sub-tree structure t in FIG. 2₄。

All medical record graph structure g obtained as the sequence of medical record graphs in FIG. 1₄，g₂And g₃The four medical record subtree structures t in fig. 2 can be obtained through segmentation₁，t₂，t₃And t₄Table 2 shows the inverted index of the graph structure constructed based on the subtree structure of the medical records, and freq is the frequency of the corresponding subtree.

Table 2 figure structure inverted index

This inverted index of the subtree consists of two main parts: an index for all of the different medical record nodes, and a posting list under each medical record node. Each entry in the posting list contains an identifier of a subtree and a frequency with which the corresponding medical record node appears in the subtree. All lists are sorted in order of increasing subtree size.

Medical record subtree structure t as in FIG. 2₁，t₂，t₃And t₄Table 3 shows an inverted index of a subtree constructed based on medical record nodes, i.e., medical record node n₁,…,n₁₀To index the structure, each medical record sub-tree structure is further decomposed into cells (i.e., vertices and edges) and indexed in an inverted list. The index also contains two components: a label index arranged in ascending alphabetical order of medical record node label names (if n is₁Is diabetes, n₂Is influenza, then n₂Arranged at n₁Front) and the underlying inverted list of labels, records the subtree identity and the frequency of the corresponding label. Entries in each listFirst grouped by leaf size and then sorted in each group by decreasing frequency. So-called leaf nodes, i.e., a medical record node, do not have any edges pointing to other medical record nodes. For example, the first column of Table 3 is the label { t } for each medical record subtree structure₁，t₂，t₃，t₄The second column is the number of leaf nodes of a medical record subtree structure, such as the medical record subtree structure t in FIG. 2₁Has 2 leaf nodes n₂，n₃Structure t of the subtree of the case history₂Having 1 leaf node n₂Subtree structure t of medical record₃Has 2 leaf nodes n₂，n₂Structure t of the subtree of the case history₄Having 1 leaf node n₂. In Table 3, column 1 is a subtree label, and column 2 is a medical record node n₂Number of occurrences in the subtree, since n₂In subtree t₃Is most frequently present, so n₂In table t₃Located in the first row.

TABLE 3 inverted subtree indexing

Step S2: the graph sequence to be inquired is obtained, the graph sequence comprises a plurality of graph structures, the graph structures are decomposed into a sub-tree sequence, the sub-tree sequence comprises a plurality of sub-tree structures, each sub-tree structure is decomposed into a node sequence, and the node sequence comprises a plurality of nodes.

The sub-tree structure is represented by a triplet r, L, L, where r is the root node of the sub-tree structure, L is the set of leaf nodes, and L is a labeling function, e.g., the sub-tree structure t in FIG. 3₃Root node n of₅Its label is diabetes, subtree structure t₄Root node n of₅Its label is also diabetes.

Calculating subtree approximate distances among all subtree structures, wherein the specific formula is as follows:

λ(t₁,t₂)＝T(r₁,r₂)+d(L₁,L₂)

d(L₁,L₂)＝||L₁|-|L₂||+M(L₁,L₂)

wherein, λ (t)₁,t₂) Representing a subtree structure t₁And subtree structure t₂Subtree approximation distance between, T (r)₁,r₂) Is dependent on l (r)₁) And l (r)₂) Is equal to no, if l (r)₁)＝l(r₂) I.e. the labels of the two root nodes coincide, then T (r)₁,r₂) 0, otherwise T (r)₁,r₂) 1, l denotes a marking function, r₁Representing a subtree structure t₁Root node of r₂Is a sub-tree structure t₂Root node of, L₁Representing a subtree structure t₁Set of leaf nodes of, L₂Representing a subtree structure t₂Set of leaf nodes of d (L)₁,L₂) Represents L₁And L₂The distance between the two sets of the data is,

represents a sub-tree t₁A set of labels corresponding to each leaf node in the set,

represents a sub-tree t₂The set of labels corresponding to the leaf nodes of (1).

As shown in FIG. 3, a subtree structure t is computed₃And subtree structure t₄Subtree approximation distance between T (r)₃,r₄) Apparently, l (r)₃)＝l(r₄)＝n₅Then, T (r)₃,r₄) 0. Knowing the structure t of the subtree₃The owned leaf node list is L₃＝{n₂,n₂So the list size | L₃|＝2,|L₄|＝1,

Therefore, λ (t) is calculated₃,t₄)＝0+|2-1|+2-1＝2。

Step S3: giving a subtree structure t to be queried_qSize table of (1).

TABLE 4 size table

Step S4: based on the inverted index table and the size table of the subtree, a subtree approximation query algorithm is adopted to obtain a subtree approximation table corresponding to each subtree structure, and the method specifically comprises the following steps:

step S42: sorting the subtree sorting tables according to the size table, wherein the subtree sorting tables smaller than the size table use alpha-2 x | L_qComputing alpha by | - (t (β) + τ); ordering tables for subtrees greater than or equal to size table using α ═ L_qComputing α, | - (t (β) -2 ×); wherein, alpha represents the structure t of each sub-tree and the sub-tree to be inquired_qAn approximate distance of L_qRepresenting a structure t of a subtree to be queried_qT (β) represents the number of common leaf labels, and τ represents the last size value seen in the size table.

Let t_qThere are m frequencies of { f₁,f₂,……,f_mAnd the specific calculation formula for calculating the number of the public leaf labels is as follows:

wherein, beta_jRepresenting a structure t of a subtree to be queried_qIs arranged in the jth sub-tree sorting list_iIf the subtree structure t_iNot appearing in the subtree ordered list, then β_jIf the subtree structure t is 0_iAppearing in the subtree ordered list, then β_j＝1，f_jFor sub-tree structures t to be queried_qThe jth frequency of (c).

The invention can obtain the subtree structure which is highly similar to the subtree to be inquired, namely the subtree approximation table, by utilizing the subtree approximation inquiry algorithm.

The specific formula for calculating the approximate distance of the subtree is as follows:

λ(t_q,t_i)＝T(r₁,r₂)+d(L₁,L₂)

d(L₁,L₂)＝||L₁|-|L₂||+M(L₁,L₂)

wherein, λ (t)_q,t_i) Representing a structure t of a subtree to be queried_qAnd subtree structure t_iSubtree approximation distance between, T (r)₁,r₂) Is dependent on l (r)₁) And l (r)₂) Is equal to no, if l (r)₁)＝l(r₂) I.e. the labels of the two root nodes coincide, then T (r)₁,r₂) 0, otherwise T (r)₁,r₂) 1, l denotes a marking function, r₁Representing a structure t of a subtree to be queried_qRoot node of r₂Is a sub-tree structure t_iRoot node of, L₁Indicates that it is to be checkedStructure t of query tree_qSet of leaf nodes of, L₂Representing a subtree structure t_iSet of leaf nodes of d (L)₁,L₂) Represents L₁And L₂The distance between the two sets of the data is,

representing a structure t of a subtree to be queried_qA set of labels corresponding to each leaf node in the set,

representing a subtree structure t_iThe set of labels corresponding to the leaf nodes of (1).

In FIG. 4, the left matrix is the sub-tree sequence T (g)₁) And T (g)₂) The subtree distance matrix between, the gray grid representing the subtree sequence T (g)₁) And subtree sequence T (g)₂) The best match of, i.e. mu (g)₁,g₂) 0+0+ 9-9. For clarity, g is represented by a sub-tree sequence on the right₁、g₂And the best match is marked with a solid arrow.

Step S5: based on the graph structure inverted index table and the sub-tree approximate table corresponding to each sub-tree structure, a graph structure approximate query algorithm is adopted to obtain a graph structure approximate table corresponding to each graph structure, and the method specifically comprises the following steps:

step S52: by using

Calculating a subtree approximate distance evaluation sum gamma, wherein M represents the total number of subtree sorting tables, theta_jAnd indicating the subtree approximate distance of the current access position in the jth subtree sorting table.

The specific formula for calculating the approximate distance of the graph structure is as follows:

wherein P represents a sub-tree sequence T (g)_q)→T(g_i) Double-shot, subtree structure t of_iBelongs to a subtree sequence T (g)_i)，λ(t_i,P(t_i) Is t)_iAnd P (t)_i) Approximate distance of the two subtree structures, P (t)_i) Is g_qMiddle and subtree structure t_iAn aligned subtree structure.

By using

Calculating the approximate distance μ (g) of the graph structure_q,g_i) The lower limit of (d); wherein the content of the first and second substances,

shows diagram structure g_iDistance, κ (g) of first subtree structure in jth subtree sorting table_q,g_i) Approximate distance mu (g) for graph structure_q,g_i) The specific formula of (2) is as follows:

S_jshows diagram structure g_iIn graph structure g to be queried_qThe local minimum subtree approximation distance below the jth subtree sorted list of (a),

graph structure g containing the following jth list_iAll of the subtrees of (a) are approximately the distance,

e_xis a graph structure g_iIs approximately the distance.

If κ (g)_q,g_i) Greater than the uppermost k values, g can be safely filtered out_i。

Step S6: based on the inverted index table of the graph sequences and the graph structure approximation table corresponding to each graph structure, a graph sequence approximation query algorithm is adopted to obtain a graph sequence approximation table corresponding to each graph sequence, and the method specifically comprises the following steps:

step S62: by using

Calculating a graph structure approximation distance evaluation sum K, where N represents the total number of graph structure sorting tables, ω_kGraph structure approximate distance representing the current visited location in the kth graph structure sorted list.

Dynamic time warping is used to measure the similarity between two graph sequences, represented by graph alignment distance. The graph alignment distance is defined as follows: given two graph sequences G¹And G²And a collection of graph structures thereof, and a bijection P: G¹→G²,G¹And G²The alignment distance therebetween is represented by ω (G)¹,G²) The specific formula for calculating the alignment distance of the graph sequence is as follows:

wherein, ω (G)^q,Gⁱ²) Represents the diagram sequence G¹And graph sequence G²P represents the graph sequence G^q→GⁱThe double-shot of (2) is performed,

is a diagram sequence GⁱOne of the graph structures obtained by the segmentation is obtained,

is composed of

And

the approximate distance between the two graph structures,

is G^qNeutralization of

Aligned graph structure.

ω(G¹,G²) The calculation of (a) is equivalent to solving the sequence alignment problem, which is usually solved using a dynamic time warping algorithm for the purpose ofThe goal is to find the smallest weight match in a given cost matrix.

By using

Calculating the graph sequence alignment distance ω (G)^q,Gⁱ) Lower bound of κ (G)^q,Gⁱ) Wherein, in the step (A),

is a graph sequence G in the jth ordered listⁱMaps the distance to the first found map. Obviously, if κ (G)^q,Gⁱ) Greater than the top k values, G can be safely filtered outⁱThe sequence of the graphs.

the three-layer inverted index construction module is used for constructing three-layer inverted indexes of the medical record sequence database based on a subtree decomposition algorithm; the three-layer inverted index table comprises a graph sequence inverted index table, a graph structure inverted index table and a subtree inverted index table.

The system comprises an acquisition module and a query module, wherein the acquisition module is used for acquiring a graph sequence to be queried, the graph sequence comprises a plurality of graph structures, the graph structures are decomposed into a sub-tree sequence, the sub-tree sequence comprises a plurality of sub-tree structures, each sub-tree structure is decomposed into a node sequence, and the node sequence comprises a plurality of nodes.

And the giving module is used for giving the size table of the subtree structure t _ q to be inquired.

And the sub-tree approximation table query module is used for obtaining a sub-tree approximation table corresponding to each sub-tree structure by adopting a sub-tree approximation query algorithm based on the sub-tree inverted index table and the size table.

And the graph structure approximation table query module is used for obtaining a graph structure approximation table corresponding to each graph structure by adopting a graph structure approximation query algorithm based on the graph structure inverted index table and the subtree approximation table corresponding to each subtree structure.

As an optional implementation manner, the three-layer inverted index building module of the present invention specifically includes:

and the graph sequence inverted index table construction unit is used for decomposing each medical record graph sequence into a medical record graph structure sequence, and establishing a graph sequence inverted index table corresponding to all the medical record graph sequences by taking each medical record graph structure in the medical record graph structure sequence as an index.

And the graph structure inverted index table construction unit is used for decomposing each medical record graph structure into a medical record sub-tree sequence, and establishing a graph structure inverted index table corresponding to all the medical record graph structures by taking each medical record sub-tree structure in the medical record sub-tree sequence as an index.

As an optional implementation manner, the sub-tree approximation table query module of the present invention specifically includes:

and the sub-tree sequencing table determining unit is used for accessing the sub-tree inverted index table to obtain the sub-tree sequencing table corresponding to each node.

An approximate distance calculation unit for classifying the sub-tree ranking tables according to the size table, each sub-tree ranking table smaller than the size table using α -2 × | L_qComputing alpha by | - (t (β) + τ); ordering tables for subtrees greater than or equal to size table using α ═ L_qComputing α, | - (t (β) -2 ×); wherein, alpha represents the structure t of each sub-tree and the sub-tree to be inquired_qAn approximate distance of L_qRepresenting a structure t of a subtree to be queried_qT (β) represents the number of common leaf labels, and τ represents the last size value seen in the size table.

As an optional implementation manner, the graph structure approximation table query module of the present invention specifically includes:

and the graph structure ordering table determining unit is used for accessing the subtree approximation table row by row and combining and ordering the graph structure inverted index tables corresponding to the k1 approximation subtrees of the subtree structure to obtain the graph structure ordering table.

A subtree approximate distance evaluation sum determination unit for utilizing

As an optional implementation manner, the graph sequence approximation table query module of the present invention specifically includes:

and the graph sequence ordering table determining unit is used for accessing the graph structure approximate table row by row, and combining and ordering the graph sequence reverse index tables corresponding to the k2 approximate graph structures of the graph structure to obtain the graph sequence ordering table.

In order to check the experimental result more clearly and intuitively, the running time result is displayed in a data table and a histogram. Table 5 lists the run-time data results of the CSM-KNN algorithm and the Feature-KNN algorithm on the two data sets, and shows the run-time comparison results of the CSM-KNN algorithm and the Feature-KNN algorithm on the data set MIMIC-III using the graph (a) in fig. 7, and the run-time comparison results of the CSM-KNN algorithm and the Feature-KNN algorithm on the data set OH using the graph (b) in fig. 7. From fig. 7, it can be concluded that the medical record approximate search algorithm researched by the invention can efficiently support practical application.

TABLE 5 CSM-KNN Algorithm versus Feature-KNN Algorithm runtime (sec) comparison results

In order to check the experimental result more clearly and visually, the accuracy comparison result is displayed by using a data table and a histogram. Table 6 lists the mean accuracy data results of the CSM-KNN algorithm and the Feature-KNN algorithm on the data set MIMIC-III, and shows the accuracy comparison results of the CSM-KNN algorithm and the Feature-KNN algorithm on the data set MIMIC-III by using the graph (a) in FIG. 8, and shows the accuracy comparison results of the CSM-KNN algorithm and the Feature-KNN algorithm on the data set OH by using the graph (b) in FIG. 8.

TABLE 6 average accuracy comparison of CSM-KNN algorithm and Feature-KNN algorithm

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and similar parts between the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A medical record graph sequence retrieval method based on subtree inverted index is characterized by comprising the following steps:

step S3: giving a subtree structure t to be queried_qSize table of (1);

2. The method for retrieving medical record graph sequences based on inverted subtree indexing as claimed in claim 1, wherein said step S1 specifically includes:

3. The method for retrieving medical record graph sequences based on inverted subtree indexing as claimed in claim 1, wherein said step S4 specifically includes:

4. The method for retrieving medical record graph sequences based on inverted subtree indexing as claimed in claim 3, wherein said step S5 specifically includes:

step S52: by using

5. The method for retrieving medical record graph sequences based on inverted subtree indexing as claimed in claim 4, wherein said step S6 specifically includes:

step S62: by using

Calculating a graph structure approximation distance evaluation sum K, where N represents the total number of graph structure sorting tables, ω_kRepresents the kth figure nodeConstructing a graph structure approximate distance of the current access position in the sorting table;

6. A system for retrieving a sequence of medical records based on inverted indexes of subtrees, the system comprising:

7. The system of claim 6, wherein the three-level inverted index construction module comprises:

8. The system of claim 6, wherein the sub-tree approximation table query module comprises:

9. The system of claim 8, wherein the graph structure approximation table query module comprises:

a subtree approximate distance evaluation sum determination unit for utilizing

10. The medical record graph sequence retrieval system based on inverted index of subtree as claimed in claim 9, wherein said graph sequence approximation table query module specifically comprises: