CN113010746A - Medical record sequence retrieval method and system based on subtree inverted index - Google Patents

Medical record sequence retrieval method and system based on subtree inverted index Download PDF

Info

Publication number
CN113010746A
CN113010746A CN202110294328.9A CN202110294328A CN113010746A CN 113010746 A CN113010746 A CN 113010746A CN 202110294328 A CN202110294328 A CN 202110294328A CN 113010746 A CN113010746 A CN 113010746A
Authority
CN
China
Prior art keywords
graph
sequence
approximate
sub
tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110294328.9A
Other languages
Chinese (zh)
Other versions
CN113010746B (en
Inventor
王晓黎
黄烨钒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen University
Original Assignee
Xiamen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen University filed Critical Xiamen University
Priority to CN202110294328.9A priority Critical patent/CN113010746B/en
Publication of CN113010746A publication Critical patent/CN113010746A/en
Application granted granted Critical
Publication of CN113010746B publication Critical patent/CN113010746B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9027Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a medical record sequence retrieval method and a medical record sequence retrieval system based on subtree inverted indexes, wherein firstly, three layers of inverted indexes of a medical record sequence database are constructed based on a subtree decomposition algorithm; secondly, based on the subtree inverted index table and the size table, a subtree approximation query algorithm is adopted to obtain a subtree approximation table corresponding to each subtree structure; then based on the graph structure inverted index table and the sub-tree approximate table corresponding to each sub-tree structure, a graph structure approximate table corresponding to each graph structure is obtained by adopting a graph structure approximate query algorithm; and finally, obtaining a graph sequence approximation table corresponding to each graph sequence by adopting a graph sequence approximation query algorithm based on the graph sequence inverted index table and the graph structure approximation table corresponding to each graph structure. By adopting the method and the system, the three-layer inverted index is combined with the approximate query algorithm, the relation between the multi-mode data is established, and the case approximate search is carried out on the basis, so that the search accuracy is improved.

Description

Medical record sequence retrieval method and system based on subtree inverted index
Technical Field
The invention relates to the field of medical record sequence retrieval, in particular to a medical record sequence retrieval method and system based on subtree inverted index.
Background
With the rapid development of information technology, data forms of various industries are more and more vivid and diversified, and multi-mode data such as texts, pictures, audio and video with rich contents are generated.
Due to the diversity, complexity and randomness of the multi-modal data, structured unified management is difficult to realize. Secondly, the data are often related, and the true value of the data can be realized only by mining the potential relations. Traditional database technologies often process data of a single modality, and data of different modalities are represented by different complex data models, such as character strings, trees, graphs, high-dimensional data, dynamic sequences and the like. These methods cannot represent the relevance between multimodal data and cannot meet the comprehensive requirements of people for information retrieval. Some proposals for cross-media unified indexing techniques solve the cross-domain query problem for portions of data with significant semantic relevance. However, these solutions are only applicable to social media data and the like with high data relevance, but cannot process medical health data with fuzzy semantic relations. Because the data base is not firm, the analysis result is often of little significance and lacks practicality. Therefore, how to effectively and uniformly model and index multi-modal data with fuzzy semantic association is an important scientific problem to be solved by the research.
In addition, the structure and content of the multimodal medical health data is not invariable, but may be deduced and changed over time. For example, electronic medical records often contain multiple medical records of a patient, and the data structure and content generated by each record are often not fixed; the health information such as body temperature collected by the medical mobile platform is more different along with the change of the physical state of the user. Analyzing such dynamically changing attributes of the clear data is of great significance in both the prediction of patient condition and the monitoring of user health. The existing medical big data analysis method cannot describe the dynamic attribute of the data, and a complex machine learning algorithm is often needed to analyze and predict the dynamic deduction condition between the data. Due to the limitation of artificial intelligence in processing complex, variable and dynamic environments, the accuracy of analysis results is often too low. Therefore, how to design a new dynamic model to accurately describe the deduction and change of the medical health data over time is a key scientific problem to be solved by this research 66.
Disclosure of Invention
The invention aims to provide a medical record sequence retrieval method and system based on subtree inverted index, so as to improve the accuracy of case search.
In order to achieve the above object, the present invention provides a medical record graph sequence retrieval method based on subtree inverted index, which comprises:
step S1: constructing three-layer inverted indexes of a medical record sequence database based on a subtree decomposition algorithm; the three-layer inverted index table comprises a graph sequence inverted index table, a graph structure inverted index table and a subtree inverted index table;
step S2: obtaining a graph sequence to be queried, wherein the graph sequence comprises a plurality of graph structures, the graph structures are decomposed into a sub-tree sequence, the sub-tree sequence comprises a plurality of sub-tree structures, each sub-tree structure is decomposed into a node sequence, and the node sequence comprises a plurality of nodes;
step S3: giving a subtree structure t to be queriedqSize table of (1);
step S4: based on the subtree inverted index table and the size table, a subtree approximation query algorithm is adopted to obtain a subtree approximation table corresponding to each subtree structure;
step S5: based on the graph structure inverted index table and the sub-tree approximate table corresponding to each sub-tree structure, a graph structure approximate table corresponding to each graph structure is obtained by adopting a graph structure approximate query algorithm;
step S6: and obtaining a graph sequence approximation table corresponding to each graph sequence by adopting a graph sequence approximation query algorithm based on the graph sequence inverted index table and the graph structure approximation table corresponding to each graph structure.
Optionally, the step S1 specifically includes:
step S11: decomposing each medical record graph sequence into a medical record graph structure sequence, and establishing a graph sequence inverted index table corresponding to all the medical record graph sequences by taking each medical record graph structure in the medical record graph structure sequence as an index;
step S12: decomposing each medical record graph structure into a medical record sub-tree sequence, and establishing a graph structure inverted index table corresponding to all medical record graph structures by taking each medical record sub-tree structure in the medical record sub-tree sequence as an index;
step S13: and decomposing each medical record subtree structure into a medical record node sequence, and establishing a subtree inverted index table corresponding to all medical record subtree structures by taking each medical record node in the medical record node sequence as an index.
Optionally, the step S4 specifically includes:
step S41: accessing the subtree inverted index table to obtain a subtree sequencing table corresponding to each node;
step S42: sorting the subtree sorting tables according to the size table, wherein the subtree sorting tables smaller than the size table use alpha-2 x | LqComputing alpha by | - (t (β) + τ); ordering tables for subtrees greater than or equal to size table using α ═ LqComputing α, | - (t (β) -2 ×); wherein, alpha represents the structure t of each sub-tree and the sub-tree to be inquiredqAn approximate distance of LqRepresenting a structure t of a subtree to be queriedqT (β) represents the number of common leaf labels, τ represents the last seen size value in the size table;
step S43: accessing the current sub-tree structure in the sub-tree sequencing table, and judging whether alpha is larger than the sub-tree approximate distance maximum in the sub-tree approximate table; if alpha is larger than the sub-tree with the maximum approximate distance in the sub-tree approximate table, stopping subsequent access and outputting the sub-tree approximate table corresponding to each sub-tree structure; if alpha is smaller than or equal to the sub-tree approximate distance maximum in the sub-tree approximate table, adding the currently accessed sub-tree structure in the sub-tree ordering table into the sub-tree approximate table, and accessing the next sub-tree structure in the sub-tree ordering table until alpha is larger than the sub-tree approximate distance maximum in the sub-tree approximate table; each subtree approximation table comprises k1 subtree structures, and the subtree structures in the subtree approximation table are subsequently called approximation subtrees.
Optionally, the step S5 specifically includes:
step S51: accessing the subtree approximation table row by row, and combining and sorting the graph structure inverted index tables corresponding to k1 approximation subtrees of the subtree structure to obtain a graph structure sorting table;
step S52: by using
Figure BDA0002983738320000031
Calculating a subtree approximate distance evaluation sum gamma, wherein M represents the total number of subtree sorting tables, thetajRepresenting the subtree approximate distance of the current access position in the jth subtree sorting table;
step S53: accessing the current graph structure in the graph structure sorting table, and judging whether the evaluation sum of the approximate distances of the subtrees is greater than the maximum approximate distance of the graph structure in the graph structure approximate table; if the distance is larger than the maximum graph structure approximate distance in the graph structure approximate table, stopping subsequent access and outputting the graph structure approximate table corresponding to each graph structure; if the distance is smaller than or equal to the maximum graph structure approximate distance in the graph structure approximate table, each graph structure is added into the graph structure approximate table, and the next graph structure in the graph structure sorting table is accessed until the evaluation sum of the sub-tree approximate distances is larger than the maximum graph structure approximate distance in the graph structure approximate table; each graph structure approximation table includes k2 graph structures, and the graph structures in the graph structure approximation table are subsequently referred to as approximate graph structures.
Optionally, the step S6 specifically includes:
step S61: accessing the graph structure approximate table line by line, and combining and sorting the graph sequence inverted index tables corresponding to the k2 approximate graph structures of the graph structure to obtain a graph sequence sorting table;
step S62: by using
Figure BDA0002983738320000041
Calculating a graph structure approximation distance evaluation sum K, where N represents the total number of graph structure sorting tables, ωkA graph structure approximate distance representing a current visited location in the kth graph structure sorted list;
step S63: accessing the current graph sequence in the graph sequence sorting table, and judging whether the graph structure approximate distance evaluation sum is larger than the maximum graph sequence alignment distance in the graph sequence approximate table; if the distance is larger than the maximum alignment distance of the graph sequence in the graph sequence approximation table, stopping subsequent access and outputting the graph sequence approximation table corresponding to the graph sequence; if the distance is less than or equal to the maximum alignment distance of the graph sequence in the graph sequence approximate table, adding each graph sequence into the graph sequence approximate table, and accessing the next graph sequence in the graph sequence sorting table until the evaluation sum of the graph structure approximate distances is greater than the maximum graph structure approximate distances in the graph sequence approximate table; each map sequence approximation table comprises k3 map sequences, and the map sequences in the map sequence approximation table are subsequently called approximation map sequences.
The invention also provides a medical record image sequence retrieval system based on the subtree inverted index, which comprises:
the three-layer inverted index construction module is used for constructing three-layer inverted indexes of the medical record sequence database based on a subtree decomposition algorithm; the three-layer inverted index table comprises a graph sequence inverted index table, a graph structure inverted index table and a subtree inverted index table;
an obtaining module, configured to obtain a graph sequence to be queried, where the graph sequence includes a plurality of graph structures, the graph structure is decomposed into a sub-tree sequence, the sub-tree sequence includes a plurality of sub-tree structures, and then each sub-tree structure is decomposed into a node sequence, where the node sequence includes a plurality of nodes;
a given module for giving the subtree structure t to be inquiredqSize table of (1);
the sub-tree approximate table query module is used for obtaining a sub-tree approximate table corresponding to each sub-tree structure by adopting a sub-tree approximate query algorithm based on the sub-tree inverted index table and the size table;
the graph structure approximation table query module is used for obtaining a graph structure approximation table corresponding to each graph structure by adopting a graph structure approximation query algorithm based on the graph structure inverted index table and the subtree approximation table corresponding to each subtree structure;
and the graph sequence approximation table query module is used for acquiring a graph sequence approximation table corresponding to each graph sequence by adopting a graph sequence approximation query algorithm based on the graph sequence inverted index table and the graph structure approximation table corresponding to each graph structure.
Optionally, the three-layer inverted index building module specifically includes:
the map sequence inverted index table construction unit is used for decomposing each medical record map sequence into a medical record map structure sequence, taking each medical record map structure in the medical record map structure sequence as an index, and establishing a map sequence inverted index table corresponding to all the medical record map sequences;
the graph structure inverted index table construction unit is used for decomposing each medical record graph structure into a medical record sub-tree sequence, and establishing a graph structure inverted index table corresponding to all the medical record graph structures by taking each medical record sub-tree structure in the medical record sub-tree sequence as an index;
and the subtree inverted index table construction unit is used for decomposing each medical record subtree structure into a medical record node sequence, and establishing subtree inverted index tables corresponding to all medical record subtree structures by taking each medical record node in the medical record node sequence as an index.
Optionally, the sub-tree approximation table querying module specifically includes:
the sub-tree sequencing table determining unit is used for accessing the sub-tree inverted index table to obtain a sub-tree sequencing table corresponding to each node;
an approximate distance calculation unit for classifying the sub-tree ranking tables according to the size table, each sub-tree ranking table smaller than the size table using α -2 × | LqComputing alpha by | - (t (β) + τ); ordering tables for subtrees greater than or equal to size table using α ═ LqComputing α, | - (t (β) -2 ×); wherein, alpha represents the structure t of each sub-tree and the sub-tree to be inquiredqAn approximate distance of LqRepresenting a structure t of a subtree to be queriedqT (β) represents the number of common leaf labels, τ represents the last seen size value in the size table;
the first judging unit is used for accessing the current sub-tree structure in the sub-tree sequencing table and judging whether alpha is larger than the sub-tree approximate distance maximum in the sub-tree approximate table; if alpha is larger than the sub-tree with the maximum approximate distance in the sub-tree approximate table, stopping subsequent access and outputting the sub-tree approximate table corresponding to each sub-tree structure; if alpha is smaller than or equal to the sub-tree approximate distance maximum in the sub-tree approximate table, adding the currently accessed sub-tree structure in the sub-tree ordering table into the sub-tree approximate table, and accessing the next sub-tree structure in the sub-tree ordering table until alpha is larger than the sub-tree approximate distance maximum in the sub-tree approximate table; each subtree approximation table comprises k1 subtree structures, and the subtree structures in the subtree approximation table are subsequently called approximation subtrees.
Optionally, the graph structure approximation table query module specifically includes:
the graph structure ordering table determining unit is used for accessing the subtree approximation table row by row and combining and ordering graph structure inverted index tables corresponding to k1 approximation subtrees of the subtree structure to obtain a graph structure ordering table;
a subtree approximate distance evaluation sum determination unit for utilizing
Figure BDA0002983738320000051
Calculating a subtree approximate distance evaluation sum gamma, wherein M represents the total number of subtree sorting tables, thetajRepresenting the subtree approximate distance of the current access position in the jth subtree sorting table;
the second judging unit is used for accessing the current graph structure in the graph structure sorting table and judging whether the evaluation sum of the approximate distances of the subtrees is greater than the maximum approximate distance of the graph structure in the graph structure approximate table; if the distance is larger than the maximum graph structure approximate distance in the graph structure approximate table, stopping subsequent access and outputting the graph structure approximate table corresponding to each graph structure; if the distance is smaller than or equal to the maximum graph structure approximate distance in the graph structure approximate table, each graph structure is added into the graph structure approximate table, and the next graph structure in the graph structure sorting table is accessed until the evaluation sum of the sub-tree approximate distances is larger than the maximum graph structure approximate distance in the graph structure approximate table; each graph structure approximation table includes k2 graph structures, and the graph structures in the graph structure approximation table are subsequently referred to as approximate graph structures.
Optionally, the graph sequence approximation table query module specifically includes:
the graph sequence ordering table determining unit is used for accessing the graph structure approximate table line by line and combining and ordering the graph sequence reverse index tables corresponding to the k2 approximate graph structures of the graph structure to obtain a graph sequence ordering table;
a graph structure approximate distance evaluation sum determination unit for utilizing
Figure BDA0002983738320000061
Calculating a graph structure approximation distance evaluation sum K, where N represents the total number of graph structure sorting tables, ωkA graph structure approximate distance representing a current visited location in the kth graph structure sorted list;
a third judging unit, configured to access a current graph sequence in the graph sequence ranking table, and judge whether the graph structure approximate distance evaluation sum is greater than a maximum graph sequence alignment distance in the graph sequence approximate table; if the distance is larger than the maximum alignment distance of the graph sequence in the graph sequence approximation table, stopping subsequent access and outputting the graph sequence approximation table corresponding to the graph sequence; if the distance is less than or equal to the maximum alignment distance of the graph sequence in the graph sequence approximate table, adding each graph sequence into the graph sequence approximate table, and accessing the next graph sequence in the graph sequence sorting table until the evaluation sum of the graph structure approximate distances is greater than the maximum graph structure approximate distances in the graph sequence approximate table; each map sequence approximation table comprises k3 map sequences, and the map sequences in the map sequence approximation table are subsequently called approximation map sequences.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention relates to a medical record sequence retrieval method and a medical record sequence retrieval system based on subtree inverted indexes, wherein firstly, three layers of inverted indexes of a medical record sequence database are constructed based on a subtree decomposition algorithm; secondly, based on the subtree inverted index table and the size table, a subtree approximation query algorithm is adopted to obtain a subtree approximation table corresponding to each subtree structure; then based on the graph structure inverted index table and the sub-tree approximate table corresponding to each sub-tree structure, a graph structure approximate table corresponding to each graph structure is obtained by adopting a graph structure approximate query algorithm; and finally, obtaining a graph sequence approximation table corresponding to each graph sequence by adopting a graph sequence approximation query algorithm based on the graph sequence inverted index table and the graph structure approximation table corresponding to each graph structure. By adopting the method and the system, the three-layer inverted index is combined with the approximate query algorithm, the relation between the multi-mode data is established, and the case approximate search is carried out on the basis, so that the accuracy of case search is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a sequence example diagram of a patient chart according to an embodiment of the invention;
FIG. 2 is a tree diagram illustrating an embodiment of the present invention;
FIG. 3 is a diagram illustrating an exemplary calculation of subtree approximate distances according to an embodiment of the present invention;
FIG. 4 is an exemplary diagram illustrating the calculation of approximate distance of the structure according to the embodiment of the present invention;
FIG. 5 is a flowchart of a full example of an embodiment of the present invention;
FIG. 6 is a flowchart of a medical record graph sequence retrieval method based on inverted indexes of subtrees according to an embodiment of the present invention;
FIG. 7 is a graph of the results of the run time of the approximate medical record search algorithm according to the embodiment of the present invention;
FIG. 8 is a chart showing the result of the accuracy of the medical record approximate search algorithm according to the embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a medical record sequence retrieval method and system based on subtree inverted index, so as to improve the accuracy of case search.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
As shown in fig. 5-6, the present invention provides a medical record graph sequence searching method based on subtree inverted index, including:
step S1: constructing three-layer inverted indexes of a medical record sequence database based on a subtree decomposition algorithm; the three-layer inverted index table comprises a graph sequence inverted index table, a graph structure inverted index table and a subtree inverted index table;
step S2: obtaining a graph sequence to be queried, wherein the graph sequence comprises a plurality of graph structures, the graph structures are decomposed into a sub-tree sequence, the sub-tree sequence comprises a plurality of sub-tree structures, each sub-tree structure is decomposed into a node sequence, and the node sequence comprises a plurality of nodes;
step S3: giving a subtree structure t to be queriedqSize table of (1);
step S4: based on the subtree inverted index table and the size table, a subtree approximation query algorithm is adopted to obtain a subtree approximation table corresponding to each subtree structure;
step S5: based on the graph structure inverted index table and the sub-tree approximate table corresponding to each sub-tree structure, a graph structure approximate table corresponding to each graph structure is obtained by adopting a graph structure approximate query algorithm;
step S6: and obtaining a graph sequence approximation table corresponding to each graph sequence by adopting a graph sequence approximation query algorithm based on the graph sequence inverted index table and the graph structure approximation table corresponding to each graph structure.
The individual steps are discussed in detail below:
step S1: constructing three-layer inverted indexes of a medical record sequence database based on a subtree decomposition algorithm; the three-layer inverted index table comprises a graph sequence inverted index table, a graph structure inverted index table and a subtree inverted index table. The medical record map sequence database comprises a plurality of medical record map sequences;
step S11: decomposing each medical record graph sequence into a medical record graph structure sequence, taking each medical record graph structure in the medical record graph structure sequence as an index, and establishing a graph sequence inverted index table corresponding to all the medical record graph sequences, wherein the specific implementation process is as follows:
disease of given patient iCalendar picture sequence
Figure BDA0002983738320000081
Wherein, gjIs the chart structure of the patient i's case history at the j-th admission, | GiI represents the chart sequence G of the medical recordsiThe number of the chart structures (i.e., the number of admissions of patient i) in the middle chart, and all chart structures in the chart sequence are sorted in order according to the actual admission time of the patient (e.g., chart g)1Is earlier than the case history g2The time of admission). As is easy to see from the representation form of the medical record graph sequence, the medical record graph sequence comprises medical record graph structures corresponding to a sequence of admission records of a patient, and each medical record graph structure represents a medical record structure corresponding to a certain admission of the patient, so that the medical record graph sequence can be visually decomposed into a sequence of graph structures, and the medical record graph structures are used as indexes, so that the lower part of each medical record graph structure index comprises an inverted list for storing the related information of the graph sequence. Each entry in the posting list stores an identifier of the graph sequence and the frequency with which the corresponding graph structure appears in the graph sequence.
As shown in FIG. 1, given a chart sequence database containing two patients, it is denoted as { G }1,G2In which, the chart of case history is G1Comprises two admission records, which are respectively expressed as a medical record structure g1And g2(ii) a Medical record chart sequence G2Also comprises two admission records which are respectively represented as a medical record structure g1And g3. Table 1 shows the chart sequence inverted index constructed based on the medical record chart sequence database, and the medical record chart structure g1And g2As an index structure, g1Inverted list of index contains two elements G 11, and { G }21, respectively representing the sequence G of medical record graphs1Containing 1 medical record chart structure g1And medical record chart sequence G2Containing 1 medical record chart structure g1;g2The inverted list of the index contains one element G 11, representing a sequence G of medical records1Containing 1 medical record chart structure g2;g3The inverted list of the index contains one element G 21, representing a sequence G of medical records2Containing 1 medical record chart structure g3
TABLE 1 inverted index of FIG. sequence
Figure BDA0002983738320000091
Step S12: decomposing each medical record graph structure into a medical record sub-tree sequence, establishing a graph structure inverted index table corresponding to all medical record graph structures by taking each medical record sub-tree structure in the medical record sub-tree sequence as an index, and specifically implementing the following processes:
a chart structure g in a sequence of chart entries for a given patient ijStructure g of medical recordjDecomposing and representing into a sequence of medical record sub-trees
Figure BDA0002983738320000092
Wherein, tmRepresents a structure of a sub-tree of a medical record, | gjI represents the medical record graph structure gjThe number of nodes in the middle medical record (i.e., g)jThe number of subtrees obtained by decomposition).
The invention takes a medical record graph structure as an example to decompose the medical record graph structure, traverses each medical record node in a medical record node set, and finds the medical record nodes with the out-degree quantity not being 0, wherein the medical record nodes and the out-degree medical record nodes form a medical record sub-tree structure, and the out-degree quantity is the quantity of edges pointing to other medical record nodes from the medical record nodes. Inverted indexes are built for all graph structures using subtrees and inverted lists. Each entry in the posting list contains an identifier of the graph structure and the frequency with which the corresponding sub-tree appears in the graph structure, with all lists sorted in order of increasing size of the graph structure.
Respectively traversing all medical record graph structures g owned in the graph 11,g2And g3. Firstly, traversing a medical record node list (n) owned by a medical record graph structure g1,n2,n3,n4Every case history node in the (1) } is n at first1From n to n1Two edges pointing to other medical record nodes are { e }1,e2Extracting e and e1,e2The related medical record nodes form a medical record sub-tree structure t in FIG. 21(ii) a Then checking n2Due to n2There are no edges pointing to other medical record nodes, so n is skipped2(ii) a Same n3There is no edge pointing to other medical record nodes, skipping n3(ii) a Last check n4,n4Having a direction n2Edge e of medical record node3Take out and e3The related patient history nodes form the patient history sub-tree structure t in FIG. 22
Then go through g2Owned medical record node list { n1,n2,n3,n5Every case history node in the (1) } is n at first1From n to n1Two edges pointing to other medical record nodes are { e }1,e2Extracting e and e1,e2The related medical record nodes form a sub-tree, which is related to t in FIG. 21The same, no longer repeat the addition; then checking n2Due to n2There are no edges pointing to other medical record nodes, so n is skipped2(ii) a Same n3There is no edge pointing to other medical record nodes, skipping n3(ii) a Last check n5,n5Having two orientations n2Edge e of medical record node3Take out and e3The related patient history nodes form the patient history sub-tree structure t in FIG. 23
Finally, traversing the medical record graph structure g3Owned medical record node list { n1,n2,n3,n5Every case history node in the (1) } is n at first1From n to n1Two edges pointing to other medical record nodes are { e }1,e2Extracting e and e1,e2The related medical record nodes form a sub-tree, which is related to t in FIG. 21The same, no longer repeat the addition; then checking n2Due to n2There are no edges pointing to other medical record nodes, so n is skipped2(ii) a Same n3There is no edge pointing to other medical record nodes, skipping n3(ii) a Last check n5,n5Has aThe bar is pointing to n2Edge e of medical record node3Take out and e3The related patient history nodes form the patient history sub-tree structure t in FIG. 24
All medical record graph structure g obtained as the sequence of medical record graphs in FIG. 14,g2And g3The four medical record subtree structures t in fig. 2 can be obtained through segmentation1,t2,t3And t4Table 2 shows the inverted index of the graph structure constructed based on the subtree structure of the medical records, and freq is the frequency of the corresponding subtree.
Table 2 figure structure inverted index
Figure BDA0002983738320000101
Step S13: and decomposing each medical record subtree structure into a medical record node sequence, and establishing a subtree inverted index table corresponding to all medical record subtree structures by taking each medical record node in the medical record node sequence as an index.
This inverted index of the subtree consists of two main parts: an index for all of the different medical record nodes, and a posting list under each medical record node. Each entry in the posting list contains an identifier of a subtree and a frequency with which the corresponding medical record node appears in the subtree. All lists are sorted in order of increasing subtree size.
Medical record subtree structure t as in FIG. 21,t2,t3And t4Table 3 shows an inverted index of a subtree constructed based on medical record nodes, i.e., medical record node n1,…,n10To index the structure, each medical record sub-tree structure is further decomposed into cells (i.e., vertices and edges) and indexed in an inverted list. The index also contains two components: a label index arranged in ascending alphabetical order of medical record node label names (if n is1Is diabetes, n2Is influenza, then n2Arranged at n1Front) and the underlying inverted list of labels, records the subtree identity and the frequency of the corresponding label. Entries in each listFirst grouped by leaf size and then sorted in each group by decreasing frequency. So-called leaf nodes, i.e., a medical record node, do not have any edges pointing to other medical record nodes. For example, the first column of Table 3 is the label { t } for each medical record subtree structure1,t2,t3,t4The second column is the number of leaf nodes of a medical record subtree structure, such as the medical record subtree structure t in FIG. 21Has 2 leaf nodes n2,n3Structure t of the subtree of the case history2Having 1 leaf node n2Subtree structure t of medical record3Has 2 leaf nodes n2,n2Structure t of the subtree of the case history4Having 1 leaf node n2. In Table 3, column 1 is a subtree label, and column 2 is a medical record node n2Number of occurrences in the subtree, since n2In subtree t3Is most frequently present, so n2In table t3Located in the first row.
TABLE 3 inverted subtree indexing
Figure BDA0002983738320000111
Step S2: the graph sequence to be inquired is obtained, the graph sequence comprises a plurality of graph structures, the graph structures are decomposed into a sub-tree sequence, the sub-tree sequence comprises a plurality of sub-tree structures, each sub-tree structure is decomposed into a node sequence, and the node sequence comprises a plurality of nodes.
The sub-tree structure is represented by a triplet r, L, L, where r is the root node of the sub-tree structure, L is the set of leaf nodes, and L is a labeling function, e.g., the sub-tree structure t in FIG. 33Root node n of5Its label is diabetes, subtree structure t4Root node n of5Its label is also diabetes.
Calculating subtree approximate distances among all subtree structures, wherein the specific formula is as follows:
λ(t1,t2)=T(r1,r2)+d(L1,L2)
d(L1,L2)=||L1|-|L2||+M(L1,L2)
Figure BDA0002983738320000112
wherein, λ (t)1,t2) Representing a subtree structure t1And subtree structure t2Subtree approximation distance between, T (r)1,r2) Is dependent on l (r)1) And l (r)2) Is equal to no, if l (r)1)=l(r2) I.e. the labels of the two root nodes coincide, then T (r)1,r2) 0, otherwise T (r)1,r2) 1, l denotes a marking function, r1Representing a subtree structure t1Root node of r2Is a sub-tree structure t2Root node of, L1Representing a subtree structure t1Set of leaf nodes of, L2Representing a subtree structure t2Set of leaf nodes of d (L)1,L2) Represents L1And L2The distance between the two sets of the data is,
Figure BDA0002983738320000121
represents a sub-tree t1A set of labels corresponding to each leaf node in the set,
Figure BDA0002983738320000122
represents a sub-tree t2The set of labels corresponding to the leaf nodes of (1).
As shown in FIG. 3, a subtree structure t is computed3And subtree structure t4Subtree approximation distance between T (r)3,r4) Apparently, l (r)3)=l(r4)=n5Then, T (r)3,r4) 0. Knowing the structure t of the subtree3The owned leaf node list is L3={n2,n2So the list size | L3|=2,|L4|=1,
Figure BDA0002983738320000123
Therefore, λ (t) is calculated3,t4)=0+|2-1|+2-1=2。
Step S3: giving a subtree structure t to be queriedqSize table of (1).
TABLE 4 size table
Figure BDA0002983738320000124
Step S4: based on the inverted index table and the size table of the subtree, a subtree approximation query algorithm is adopted to obtain a subtree approximation table corresponding to each subtree structure, and the method specifically comprises the following steps:
step S41: accessing the subtree inverted index table to obtain a subtree sequencing table corresponding to each node;
step S42: sorting the subtree sorting tables according to the size table, wherein the subtree sorting tables smaller than the size table use alpha-2 x | LqComputing alpha by | - (t (β) + τ); ordering tables for subtrees greater than or equal to size table using α ═ LqComputing α, | - (t (β) -2 ×); wherein, alpha represents the structure t of each sub-tree and the sub-tree to be inquiredqAn approximate distance of LqRepresenting a structure t of a subtree to be queriedqT (β) represents the number of common leaf labels, and τ represents the last size value seen in the size table.
Let tqThere are m frequencies of { f1,f2,……,fmAnd the specific calculation formula for calculating the number of the public leaf labels is as follows:
Figure BDA0002983738320000125
wherein, betajRepresenting a structure t of a subtree to be queriedqIs arranged in the jth sub-tree sorting listiIf the subtree structure tiNot appearing in the subtree ordered list, then βjIf the subtree structure t is 0iAppearing in the subtree ordered list, then βj=1,fjFor sub-tree structures t to be queriedqThe jth frequency of (c).
Step S43: accessing the current sub-tree structure in the sub-tree sequencing table, and judging whether alpha is larger than the sub-tree approximate distance maximum in the sub-tree approximate table; if alpha is larger than the sub-tree with the maximum approximate distance in the sub-tree approximate table, stopping subsequent access and outputting the sub-tree approximate table corresponding to each sub-tree structure; if alpha is smaller than or equal to the sub-tree approximate distance maximum in the sub-tree approximate table, adding the currently accessed sub-tree structure in the sub-tree ordering table into the sub-tree approximate table, and accessing the next sub-tree structure in the sub-tree ordering table until alpha is larger than the sub-tree approximate distance maximum in the sub-tree approximate table; each subtree approximation table comprises k1 subtree structures, and the subtree structures in the subtree approximation table are subsequently called approximation subtrees.
The invention can obtain the subtree structure which is highly similar to the subtree to be inquired, namely the subtree approximation table, by utilizing the subtree approximation inquiry algorithm.
The specific formula for calculating the approximate distance of the subtree is as follows:
λ(tq,ti)=T(r1,r2)+d(L1,L2)
d(L1,L2)=||L1|-|L2||+M(L1,L2)
Figure BDA0002983738320000131
wherein, λ (t)q,ti) Representing a structure t of a subtree to be queriedqAnd subtree structure tiSubtree approximation distance between, T (r)1,r2) Is dependent on l (r)1) And l (r)2) Is equal to no, if l (r)1)=l(r2) I.e. the labels of the two root nodes coincide, then T (r)1,r2) 0, otherwise T (r)1,r2) 1, l denotes a marking function, r1Representing a structure t of a subtree to be queriedqRoot node of r2Is a sub-tree structure tiRoot node of, L1Indicates that it is to be checkedStructure t of query treeqSet of leaf nodes of, L2Representing a subtree structure tiSet of leaf nodes of d (L)1,L2) Represents L1And L2The distance between the two sets of the data is,
Figure BDA0002983738320000132
representing a structure t of a subtree to be queriedqA set of labels corresponding to each leaf node in the set,
Figure BDA0002983738320000133
representing a subtree structure tiThe set of labels corresponding to the leaf nodes of (1).
In FIG. 4, the left matrix is the sub-tree sequence T (g)1) And T (g)2) The subtree distance matrix between, the gray grid representing the subtree sequence T (g)1) And subtree sequence T (g)2) The best match of, i.e. mu (g)1,g2) 0+0+ 9-9. For clarity, g is represented by a sub-tree sequence on the right1、g2And the best match is marked with a solid arrow.
Step S5: based on the graph structure inverted index table and the sub-tree approximate table corresponding to each sub-tree structure, a graph structure approximate query algorithm is adopted to obtain a graph structure approximate table corresponding to each graph structure, and the method specifically comprises the following steps:
step S51: accessing the subtree approximation table row by row, and combining and sorting the graph structure inverted index tables corresponding to k1 approximation subtrees of the subtree structure to obtain a graph structure sorting table;
step S52: by using
Figure BDA0002983738320000141
Calculating a subtree approximate distance evaluation sum gamma, wherein M represents the total number of subtree sorting tables, thetajAnd indicating the subtree approximate distance of the current access position in the jth subtree sorting table.
Step S53: accessing the current graph structure in the graph structure sorting table, and judging whether the evaluation sum of the approximate distances of the subtrees is greater than the maximum approximate distance of the graph structure in the graph structure approximate table; if the distance is larger than the maximum graph structure approximate distance in the graph structure approximate table, stopping subsequent access and outputting the graph structure approximate table corresponding to each graph structure; if the distance is smaller than or equal to the maximum graph structure approximate distance in the graph structure approximate table, each graph structure is added into the graph structure approximate table, and the next graph structure in the graph structure sorting table is accessed until the evaluation sum of the sub-tree approximate distances is larger than the maximum graph structure approximate distance in the graph structure approximate table; each graph structure approximation table includes k2 graph structures, and the graph structures in the graph structure approximation table are subsequently referred to as approximate graph structures.
The specific formula for calculating the approximate distance of the graph structure is as follows:
Figure BDA0002983738320000142
wherein P represents a sub-tree sequence T (g)q)→T(gi) Double-shot, subtree structure t ofiBelongs to a subtree sequence T (g)i),λ(ti,P(ti) Is t)iAnd P (t)i) Approximate distance of the two subtree structures, P (t)i) Is gqMiddle and subtree structure tiAn aligned subtree structure.
By using
Figure BDA0002983738320000144
Calculating the approximate distance μ (g) of the graph structureq,gi) The lower limit of (d); wherein the content of the first and second substances,
Figure BDA0002983738320000145
shows diagram structure giDistance, κ (g) of first subtree structure in jth subtree sorting tableq,gi) Approximate distance mu (g) for graph structureq,gi) The specific formula of (2) is as follows:
Figure BDA0002983738320000143
Sjshows diagram structure giIn graph structure g to be queriedqThe local minimum subtree approximation distance below the jth subtree sorted list of (a),
Figure BDA0002983738320000151
graph structure g containing the following jth listiAll of the subtrees of (a) are approximately the distance,
Figure BDA0002983738320000152
exis a graph structure giIs approximately the distance.
If κ (g)q,gi) Greater than the uppermost k values, g can be safely filtered outi
Step S6: based on the inverted index table of the graph sequences and the graph structure approximation table corresponding to each graph structure, a graph sequence approximation query algorithm is adopted to obtain a graph sequence approximation table corresponding to each graph sequence, and the method specifically comprises the following steps:
step S61: accessing the graph structure approximate table line by line, and combining and sorting the graph sequence inverted index tables corresponding to the k2 approximate graph structures of the graph structure to obtain a graph sequence sorting table;
step S62: by using
Figure BDA0002983738320000153
Calculating a graph structure approximation distance evaluation sum K, where N represents the total number of graph structure sorting tables, ωkGraph structure approximate distance representing the current visited location in the kth graph structure sorted list.
Step S63: accessing the current graph sequence in the graph sequence sorting table, and judging whether the graph structure approximate distance evaluation sum is larger than the maximum graph sequence alignment distance in the graph sequence approximate table; if the distance is larger than the maximum alignment distance of the graph sequence in the graph sequence approximation table, stopping subsequent access and outputting the graph sequence approximation table corresponding to the graph sequence; if the distance is less than or equal to the maximum alignment distance of the graph sequence in the graph sequence approximate table, adding each graph sequence into the graph sequence approximate table, and accessing the next graph sequence in the graph sequence sorting table until the evaluation sum of the graph structure approximate distances is greater than the maximum graph structure approximate distances in the graph sequence approximate table; each map sequence approximation table comprises k3 map sequences, and the map sequences in the map sequence approximation table are subsequently called approximation map sequences.
Dynamic time warping is used to measure the similarity between two graph sequences, represented by graph alignment distance. The graph alignment distance is defined as follows: given two graph sequences G1And G2And a collection of graph structures thereof, and a bijection P: G1→G2,G1And G2The alignment distance therebetween is represented by ω (G)1,G2) The specific formula for calculating the alignment distance of the graph sequence is as follows:
Figure BDA0002983738320000154
wherein, ω (G)q,Gi2) Represents the diagram sequence G1And graph sequence G2P represents the graph sequence Gq→GiThe double-shot of (2) is performed,
Figure BDA0002983738320000155
is a diagram sequence GiOne of the graph structures obtained by the segmentation is obtained,
Figure BDA0002983738320000156
is composed of
Figure BDA0002983738320000157
And
Figure BDA0002983738320000158
the approximate distance between the two graph structures,
Figure BDA0002983738320000159
is GqNeutralization of
Figure BDA00029837383200001510
Aligned graph structure.
ω(G1,G2) The calculation of (a) is equivalent to solving the sequence alignment problem, which is usually solved using a dynamic time warping algorithm for the purpose ofThe goal is to find the smallest weight match in a given cost matrix.
By using
Figure BDA0002983738320000161
Calculating the graph sequence alignment distance ω (G)q,Gi) Lower bound of κ (G)q,Gi) Wherein, in the step (A),
Figure BDA0002983738320000162
is a graph sequence G in the jth ordered listiMaps the distance to the first found map. Obviously, if κ (G)q,Gi) Greater than the top k values, G can be safely filtered outiThe sequence of the graphs.
The invention also provides a medical record image sequence retrieval system based on the subtree inverted index, which comprises:
the three-layer inverted index construction module is used for constructing three-layer inverted indexes of the medical record sequence database based on a subtree decomposition algorithm; the three-layer inverted index table comprises a graph sequence inverted index table, a graph structure inverted index table and a subtree inverted index table.
The system comprises an acquisition module and a query module, wherein the acquisition module is used for acquiring a graph sequence to be queried, the graph sequence comprises a plurality of graph structures, the graph structures are decomposed into a sub-tree sequence, the sub-tree sequence comprises a plurality of sub-tree structures, each sub-tree structure is decomposed into a node sequence, and the node sequence comprises a plurality of nodes.
And the giving module is used for giving the size table of the subtree structure t _ q to be inquired.
And the sub-tree approximation table query module is used for obtaining a sub-tree approximation table corresponding to each sub-tree structure by adopting a sub-tree approximation query algorithm based on the sub-tree inverted index table and the size table.
And the graph structure approximation table query module is used for obtaining a graph structure approximation table corresponding to each graph structure by adopting a graph structure approximation query algorithm based on the graph structure inverted index table and the subtree approximation table corresponding to each subtree structure.
And the graph sequence approximation table query module is used for acquiring a graph sequence approximation table corresponding to each graph sequence by adopting a graph sequence approximation query algorithm based on the graph sequence inverted index table and the graph structure approximation table corresponding to each graph structure.
As an optional implementation manner, the three-layer inverted index building module of the present invention specifically includes:
and the graph sequence inverted index table construction unit is used for decomposing each medical record graph sequence into a medical record graph structure sequence, and establishing a graph sequence inverted index table corresponding to all the medical record graph sequences by taking each medical record graph structure in the medical record graph structure sequence as an index.
And the graph structure inverted index table construction unit is used for decomposing each medical record graph structure into a medical record sub-tree sequence, and establishing a graph structure inverted index table corresponding to all the medical record graph structures by taking each medical record sub-tree structure in the medical record sub-tree sequence as an index.
And the subtree inverted index table construction unit is used for decomposing each medical record subtree structure into a medical record node sequence, and establishing subtree inverted index tables corresponding to all medical record subtree structures by taking each medical record node in the medical record node sequence as an index.
As an optional implementation manner, the sub-tree approximation table query module of the present invention specifically includes:
and the sub-tree sequencing table determining unit is used for accessing the sub-tree inverted index table to obtain the sub-tree sequencing table corresponding to each node.
An approximate distance calculation unit for classifying the sub-tree ranking tables according to the size table, each sub-tree ranking table smaller than the size table using α -2 × | LqComputing alpha by | - (t (β) + τ); ordering tables for subtrees greater than or equal to size table using α ═ LqComputing α, | - (t (β) -2 ×); wherein, alpha represents the structure t of each sub-tree and the sub-tree to be inquiredqAn approximate distance of LqRepresenting a structure t of a subtree to be queriedqT (β) represents the number of common leaf labels, and τ represents the last size value seen in the size table.
The first judging unit is used for accessing the current sub-tree structure in the sub-tree sequencing table and judging whether alpha is larger than the sub-tree approximate distance maximum in the sub-tree approximate table; if alpha is larger than the sub-tree with the maximum approximate distance in the sub-tree approximate table, stopping subsequent access and outputting the sub-tree approximate table corresponding to each sub-tree structure; if alpha is smaller than or equal to the sub-tree approximate distance maximum in the sub-tree approximate table, adding the currently accessed sub-tree structure in the sub-tree ordering table into the sub-tree approximate table, and accessing the next sub-tree structure in the sub-tree ordering table until alpha is larger than the sub-tree approximate distance maximum in the sub-tree approximate table; each subtree approximation table comprises k1 subtree structures, and the subtree structures in the subtree approximation table are subsequently called approximation subtrees.
As an optional implementation manner, the graph structure approximation table query module of the present invention specifically includes:
and the graph structure ordering table determining unit is used for accessing the subtree approximation table row by row and combining and ordering the graph structure inverted index tables corresponding to the k1 approximation subtrees of the subtree structure to obtain the graph structure ordering table.
A subtree approximate distance evaluation sum determination unit for utilizing
Figure BDA0002983738320000171
Calculating a subtree approximate distance evaluation sum gamma, wherein M represents the total number of subtree sorting tables, thetajAnd indicating the subtree approximate distance of the current access position in the jth subtree sorting table.
The second judging unit is used for accessing the current graph structure in the graph structure sorting table and judging whether the evaluation sum of the approximate distances of the subtrees is greater than the maximum approximate distance of the graph structure in the graph structure approximate table; if the distance is larger than the maximum graph structure approximate distance in the graph structure approximate table, stopping subsequent access and outputting the graph structure approximate table corresponding to each graph structure; if the distance is smaller than or equal to the maximum graph structure approximate distance in the graph structure approximate table, each graph structure is added into the graph structure approximate table, and the next graph structure in the graph structure sorting table is accessed until the evaluation sum of the sub-tree approximate distances is larger than the maximum graph structure approximate distance in the graph structure approximate table; each graph structure approximation table includes k2 graph structures, and the graph structures in the graph structure approximation table are subsequently referred to as approximate graph structures.
As an optional implementation manner, the graph sequence approximation table query module of the present invention specifically includes:
and the graph sequence ordering table determining unit is used for accessing the graph structure approximate table row by row, and combining and ordering the graph sequence reverse index tables corresponding to the k2 approximate graph structures of the graph structure to obtain the graph sequence ordering table.
A graph structure approximate distance evaluation sum determination unit for utilizing
Figure BDA0002983738320000181
Calculating a graph structure approximation distance evaluation sum K, where N represents the total number of graph structure sorting tables, ωkGraph structure approximate distance representing the current visited location in the kth graph structure sorted list.
A third judging unit, configured to access a current graph sequence in the graph sequence ranking table, and judge whether the graph structure approximate distance evaluation sum is greater than a maximum graph sequence alignment distance in the graph sequence approximate table; if the distance is larger than the maximum alignment distance of the graph sequence in the graph sequence approximation table, stopping subsequent access and outputting the graph sequence approximation table corresponding to the graph sequence; if the distance is less than or equal to the maximum alignment distance of the graph sequence in the graph sequence approximate table, adding each graph sequence into the graph sequence approximate table, and accessing the next graph sequence in the graph sequence sorting table until the evaluation sum of the graph structure approximate distances is greater than the maximum graph structure approximate distances in the graph sequence approximate table; each map sequence approximation table comprises k3 map sequences, and the map sequences in the map sequence approximation table are subsequently called approximation map sequences.
In order to check the experimental result more clearly and intuitively, the running time result is displayed in a data table and a histogram. Table 5 lists the run-time data results of the CSM-KNN algorithm and the Feature-KNN algorithm on the two data sets, and shows the run-time comparison results of the CSM-KNN algorithm and the Feature-KNN algorithm on the data set MIMIC-III using the graph (a) in fig. 7, and the run-time comparison results of the CSM-KNN algorithm and the Feature-KNN algorithm on the data set OH using the graph (b) in fig. 7. From fig. 7, it can be concluded that the medical record approximate search algorithm researched by the invention can efficiently support practical application.
TABLE 5 CSM-KNN Algorithm versus Feature-KNN Algorithm runtime (sec) comparison results
Figure BDA0002983738320000182
Figure BDA0002983738320000191
In order to check the experimental result more clearly and visually, the accuracy comparison result is displayed by using a data table and a histogram. Table 6 lists the mean accuracy data results of the CSM-KNN algorithm and the Feature-KNN algorithm on the data set MIMIC-III, and shows the accuracy comparison results of the CSM-KNN algorithm and the Feature-KNN algorithm on the data set MIMIC-III by using the graph (a) in FIG. 8, and shows the accuracy comparison results of the CSM-KNN algorithm and the Feature-KNN algorithm on the data set OH by using the graph (b) in FIG. 8.
TABLE 6 average accuracy comparison of CSM-KNN algorithm and Feature-KNN algorithm
Figure BDA0002983738320000192
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and similar parts between the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (10)

1. A medical record graph sequence retrieval method based on subtree inverted index is characterized by comprising the following steps:
step S1: constructing three-layer inverted indexes of a medical record sequence database based on a subtree decomposition algorithm; the three-layer inverted index table comprises a graph sequence inverted index table, a graph structure inverted index table and a subtree inverted index table;
step S2: obtaining a graph sequence to be queried, wherein the graph sequence comprises a plurality of graph structures, the graph structures are decomposed into a sub-tree sequence, the sub-tree sequence comprises a plurality of sub-tree structures, each sub-tree structure is decomposed into a node sequence, and the node sequence comprises a plurality of nodes;
step S3: giving a subtree structure t to be queriedqSize table of (1);
step S4: based on the subtree inverted index table and the size table, a subtree approximation query algorithm is adopted to obtain a subtree approximation table corresponding to each subtree structure;
step S5: based on the graph structure inverted index table and the sub-tree approximate table corresponding to each sub-tree structure, a graph structure approximate table corresponding to each graph structure is obtained by adopting a graph structure approximate query algorithm;
step S6: and obtaining a graph sequence approximation table corresponding to each graph sequence by adopting a graph sequence approximation query algorithm based on the graph sequence inverted index table and the graph structure approximation table corresponding to each graph structure.
2. The method for retrieving medical record graph sequences based on inverted subtree indexing as claimed in claim 1, wherein said step S1 specifically includes:
step S11: decomposing each medical record graph sequence into a medical record graph structure sequence, and establishing a graph sequence inverted index table corresponding to all the medical record graph sequences by taking each medical record graph structure in the medical record graph structure sequence as an index;
step S12: decomposing each medical record graph structure into a medical record sub-tree sequence, and establishing a graph structure inverted index table corresponding to all medical record graph structures by taking each medical record sub-tree structure in the medical record sub-tree sequence as an index;
step S13: and decomposing each medical record subtree structure into a medical record node sequence, and establishing a subtree inverted index table corresponding to all medical record subtree structures by taking each medical record node in the medical record node sequence as an index.
3. The method for retrieving medical record graph sequences based on inverted subtree indexing as claimed in claim 1, wherein said step S4 specifically includes:
step S41: accessing the subtree inverted index table to obtain a subtree sequencing table corresponding to each node;
step S42: sorting the subtree sorting tables according to the size table, wherein the subtree sorting tables smaller than the size table use alpha-2 x | LqComputing alpha by | - (t (β) + τ); ordering tables for subtrees greater than or equal to size table using α ═ LqComputing α, | - (t (β) -2 ×); wherein, alpha represents the structure t of each sub-tree and the sub-tree to be inquiredqAn approximate distance of LqRepresenting a structure t of a subtree to be queriedqT (β) represents the number of common leaf labels, τ represents the last seen size value in the size table;
step S43: accessing the current sub-tree structure in the sub-tree sequencing table, and judging whether alpha is larger than the sub-tree approximate distance maximum in the sub-tree approximate table; if alpha is larger than the sub-tree with the maximum approximate distance in the sub-tree approximate table, stopping subsequent access and outputting the sub-tree approximate table corresponding to each sub-tree structure; if alpha is smaller than or equal to the sub-tree approximate distance maximum in the sub-tree approximate table, adding the currently accessed sub-tree structure in the sub-tree ordering table into the sub-tree approximate table, and accessing the next sub-tree structure in the sub-tree ordering table until alpha is larger than the sub-tree approximate distance maximum in the sub-tree approximate table; each subtree approximation table comprises k1 subtree structures, and the subtree structures in the subtree approximation table are subsequently called approximation subtrees.
4. The method for retrieving medical record graph sequences based on inverted subtree indexing as claimed in claim 3, wherein said step S5 specifically includes:
step S51: accessing the subtree approximation table row by row, and combining and sorting the graph structure inverted index tables corresponding to k1 approximation subtrees of the subtree structure to obtain a graph structure sorting table;
step S52: by using
Figure FDA0002983738310000021
Calculating a subtree approximate distance evaluation sum gamma, wherein M represents the total number of subtree sorting tables, thetajRepresenting the subtree approximate distance of the current access position in the jth subtree sorting table;
step S53: accessing the current graph structure in the graph structure sorting table, and judging whether the evaluation sum of the approximate distances of the subtrees is greater than the maximum approximate distance of the graph structure in the graph structure approximate table; if the distance is larger than the maximum graph structure approximate distance in the graph structure approximate table, stopping subsequent access and outputting the graph structure approximate table corresponding to each graph structure; if the distance is smaller than or equal to the maximum graph structure approximate distance in the graph structure approximate table, each graph structure is added into the graph structure approximate table, and the next graph structure in the graph structure sorting table is accessed until the evaluation sum of the sub-tree approximate distances is larger than the maximum graph structure approximate distance in the graph structure approximate table; each graph structure approximation table includes k2 graph structures, and the graph structures in the graph structure approximation table are subsequently referred to as approximate graph structures.
5. The method for retrieving medical record graph sequences based on inverted subtree indexing as claimed in claim 4, wherein said step S6 specifically includes:
step S61: accessing the graph structure approximate table line by line, and combining and sorting the graph sequence inverted index tables corresponding to the k2 approximate graph structures of the graph structure to obtain a graph sequence sorting table;
step S62: by using
Figure FDA0002983738310000031
Calculating a graph structure approximation distance evaluation sum K, where N represents the total number of graph structure sorting tables, ωkRepresents the kth figure nodeConstructing a graph structure approximate distance of the current access position in the sorting table;
step S63: accessing the current graph sequence in the graph sequence sorting table, and judging whether the graph structure approximate distance evaluation sum is larger than the maximum graph sequence alignment distance in the graph sequence approximate table; if the distance is larger than the maximum alignment distance of the graph sequence in the graph sequence approximation table, stopping subsequent access and outputting the graph sequence approximation table corresponding to the graph sequence; if the distance is less than or equal to the maximum alignment distance of the graph sequence in the graph sequence approximate table, adding each graph sequence into the graph sequence approximate table, and accessing the next graph sequence in the graph sequence sorting table until the evaluation sum of the graph structure approximate distances is greater than the maximum graph structure approximate distances in the graph sequence approximate table; each map sequence approximation table comprises k3 map sequences, and the map sequences in the map sequence approximation table are subsequently called approximation map sequences.
6. A system for retrieving a sequence of medical records based on inverted indexes of subtrees, the system comprising:
the three-layer inverted index construction module is used for constructing three-layer inverted indexes of the medical record sequence database based on a subtree decomposition algorithm; the three-layer inverted index table comprises a graph sequence inverted index table, a graph structure inverted index table and a subtree inverted index table;
an obtaining module, configured to obtain a graph sequence to be queried, where the graph sequence includes a plurality of graph structures, the graph structure is decomposed into a sub-tree sequence, the sub-tree sequence includes a plurality of sub-tree structures, and then each sub-tree structure is decomposed into a node sequence, where the node sequence includes a plurality of nodes;
a given module for giving the subtree structure t to be inquiredqSize table of (1);
the sub-tree approximate table query module is used for obtaining a sub-tree approximate table corresponding to each sub-tree structure by adopting a sub-tree approximate query algorithm based on the sub-tree inverted index table and the size table;
the graph structure approximation table query module is used for obtaining a graph structure approximation table corresponding to each graph structure by adopting a graph structure approximation query algorithm based on the graph structure inverted index table and the subtree approximation table corresponding to each subtree structure;
and the graph sequence approximation table query module is used for acquiring a graph sequence approximation table corresponding to each graph sequence by adopting a graph sequence approximation query algorithm based on the graph sequence inverted index table and the graph structure approximation table corresponding to each graph structure.
7. The system of claim 6, wherein the three-level inverted index construction module comprises:
the map sequence inverted index table construction unit is used for decomposing each medical record map sequence into a medical record map structure sequence, taking each medical record map structure in the medical record map structure sequence as an index, and establishing a map sequence inverted index table corresponding to all the medical record map sequences;
the graph structure inverted index table construction unit is used for decomposing each medical record graph structure into a medical record sub-tree sequence, and establishing a graph structure inverted index table corresponding to all the medical record graph structures by taking each medical record sub-tree structure in the medical record sub-tree sequence as an index;
and the subtree inverted index table construction unit is used for decomposing each medical record subtree structure into a medical record node sequence, and establishing subtree inverted index tables corresponding to all medical record subtree structures by taking each medical record node in the medical record node sequence as an index.
8. The system of claim 6, wherein the sub-tree approximation table query module comprises:
the sub-tree sequencing table determining unit is used for accessing the sub-tree inverted index table to obtain a sub-tree sequencing table corresponding to each node;
an approximate distance calculation unit for classifying the sub-tree ranking tables according to the size table, each sub-tree ranking table smaller than the size table using α -2 × | LqComputing alpha by | - (t (β) + τ); ordering tables for subtrees greater than or equal to size table using α ═ LqComputing α, | - (t (β) -2 ×); wherein, alpha represents the structure t of each sub-tree and the sub-tree to be inquiredqAn approximate distance of LqRepresenting a structure t of a subtree to be queriedqT (β) represents the number of common leaf labels, τ represents the last seen size value in the size table;
the first judging unit is used for accessing the current sub-tree structure in the sub-tree sequencing table and judging whether alpha is larger than the sub-tree approximate distance maximum in the sub-tree approximate table; if alpha is larger than the sub-tree with the maximum approximate distance in the sub-tree approximate table, stopping subsequent access and outputting the sub-tree approximate table corresponding to each sub-tree structure; if alpha is smaller than or equal to the sub-tree approximate distance maximum in the sub-tree approximate table, adding the currently accessed sub-tree structure in the sub-tree ordering table into the sub-tree approximate table, and accessing the next sub-tree structure in the sub-tree ordering table until alpha is larger than the sub-tree approximate distance maximum in the sub-tree approximate table; each subtree approximation table comprises k1 subtree structures, and the subtree structures in the subtree approximation table are subsequently called approximation subtrees.
9. The system of claim 8, wherein the graph structure approximation table query module comprises:
the graph structure ordering table determining unit is used for accessing the subtree approximation table row by row and combining and ordering graph structure inverted index tables corresponding to k1 approximation subtrees of the subtree structure to obtain a graph structure ordering table;
a subtree approximate distance evaluation sum determination unit for utilizing
Figure FDA0002983738310000051
Calculating a subtree approximate distance evaluation sum gamma, wherein M represents the total number of subtree sorting tables, thetajRepresenting the subtree approximate distance of the current access position in the jth subtree sorting table;
the second judging unit is used for accessing the current graph structure in the graph structure sorting table and judging whether the evaluation sum of the approximate distances of the subtrees is greater than the maximum approximate distance of the graph structure in the graph structure approximate table; if the distance is larger than the maximum graph structure approximate distance in the graph structure approximate table, stopping subsequent access and outputting the graph structure approximate table corresponding to each graph structure; if the distance is smaller than or equal to the maximum graph structure approximate distance in the graph structure approximate table, each graph structure is added into the graph structure approximate table, and the next graph structure in the graph structure sorting table is accessed until the evaluation sum of the sub-tree approximate distances is larger than the maximum graph structure approximate distance in the graph structure approximate table; each graph structure approximation table includes k2 graph structures, and the graph structures in the graph structure approximation table are subsequently referred to as approximate graph structures.
10. The medical record graph sequence retrieval system based on inverted index of subtree as claimed in claim 9, wherein said graph sequence approximation table query module specifically comprises:
the graph sequence ordering table determining unit is used for accessing the graph structure approximate table line by line and combining and ordering the graph sequence reverse index tables corresponding to the k2 approximate graph structures of the graph structure to obtain a graph sequence ordering table;
a graph structure approximate distance evaluation sum determination unit for utilizing
Figure FDA0002983738310000052
Calculating a graph structure approximation distance evaluation sum K, where N represents the total number of graph structure sorting tables, ωkA graph structure approximate distance representing a current visited location in the kth graph structure sorted list;
a third judging unit, configured to access a current graph sequence in the graph sequence ranking table, and judge whether the graph structure approximate distance evaluation sum is greater than a maximum graph sequence alignment distance in the graph sequence approximate table; if the distance is larger than the maximum alignment distance of the graph sequence in the graph sequence approximation table, stopping subsequent access and outputting the graph sequence approximation table corresponding to the graph sequence; if the distance is less than or equal to the maximum alignment distance of the graph sequence in the graph sequence approximate table, adding each graph sequence into the graph sequence approximate table, and accessing the next graph sequence in the graph sequence sorting table until the evaluation sum of the graph structure approximate distances is greater than the maximum graph structure approximate distances in the graph sequence approximate table; each map sequence approximation table comprises k3 map sequences, and the map sequences in the map sequence approximation table are subsequently called approximation map sequences.
CN202110294328.9A 2021-03-19 2021-03-19 Medical record graph sequence retrieval method and system based on sub-tree inverted index Active CN113010746B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110294328.9A CN113010746B (en) 2021-03-19 2021-03-19 Medical record graph sequence retrieval method and system based on sub-tree inverted index

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110294328.9A CN113010746B (en) 2021-03-19 2021-03-19 Medical record graph sequence retrieval method and system based on sub-tree inverted index

Publications (2)

Publication Number Publication Date
CN113010746A true CN113010746A (en) 2021-06-22
CN113010746B CN113010746B (en) 2023-08-29

Family

ID=76402913

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110294328.9A Active CN113010746B (en) 2021-03-19 2021-03-19 Medical record graph sequence retrieval method and system based on sub-tree inverted index

Country Status (1)

Country Link
CN (1) CN113010746B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100106713A1 (en) * 2008-10-28 2010-04-29 Andrea Esuli Method for performing efficient similarity search
US20100125594A1 (en) * 2008-11-14 2010-05-20 The Regents Of The University Of California Method and Apparatus for Improving Performance of Approximate String Queries Using Variable Length High-Quality Grams
CN104182460A (en) * 2014-07-18 2014-12-03 浙江大学 Time sequence similarity query method based on inverted indexes
CN109033314A (en) * 2018-07-18 2018-12-18 哈尔滨工业大学 The Query method in real time and system of extensive knowledge mapping in the case of memory-limited
CN110299209A (en) * 2019-06-25 2019-10-01 北京百度网讯科技有限公司 Similar case history lookup method, device, equipment and readable storage medium storing program for executing
CN111309979A (en) * 2020-02-27 2020-06-19 桂林电子科技大学 RDF Top-k query method based on neighbor vector

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100106713A1 (en) * 2008-10-28 2010-04-29 Andrea Esuli Method for performing efficient similarity search
US20100125594A1 (en) * 2008-11-14 2010-05-20 The Regents Of The University Of California Method and Apparatus for Improving Performance of Approximate String Queries Using Variable Length High-Quality Grams
CN104182460A (en) * 2014-07-18 2014-12-03 浙江大学 Time sequence similarity query method based on inverted indexes
CN109033314A (en) * 2018-07-18 2018-12-18 哈尔滨工业大学 The Query method in real time and system of extensive knowledge mapping in the case of memory-limited
CN110299209A (en) * 2019-06-25 2019-10-01 北京百度网讯科技有限公司 Similar case history lookup method, device, equipment and readable storage medium storing program for executing
CN111309979A (en) * 2020-02-27 2020-06-19 桂林电子科技大学 RDF Top-k query method based on neighbor vector

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵展浩等: "基于SQL的图相似性查询方法", 软件学报, no. 03, pages 169 - 182 *

Also Published As

Publication number Publication date
CN113010746B (en) 2023-08-29

Similar Documents

Publication Publication Date Title
CN113707297B (en) Medical data processing method, device, equipment and storage medium
CN103870516B (en) Retrieve the method for image, paint in real time reminding method and its device
US20120136859A1 (en) Entity Type Assignment
CN103425740B (en) A kind of material information search method based on Semantic Clustering of internet of things oriented
US20160070751A1 (en) Database management system
CN102043819A (en) HTML table semantic venation analysis method
CN106777996A (en) A kind of physical examination data search system based on Solr
Deng et al. ReFinder: A context-based information refinding system
US9594755B2 (en) Electronic document repository system
Wang et al. Automatic diagnosis with efficient medical case searching based on evolving graphs
Gómez‐Núñez et al. Updating the SCI mago journal and country rank classification: A new approach using W ard's clustering and alternative combination of citation measures
Jiang et al. A domain ontology approach in the ETL process of data warehousing
CN111696656A (en) Doctor evaluation method and device of Internet medical platform
CN112330510A (en) Volunteer recommendation method and device, server and computer-readable storage medium
Martín-Valdivia et al. Using information gain to improve multi-modal information retrieval systems
Zhang et al. Metaphor research in the 21st century: A bibliographic analysis
Kianian et al. Semantic community detection using label propagation algorithm
Safaei Text-based multi-dimensional medical images retrieval according to the features-usage correlation
CN113010746B (en) Medical record graph sequence retrieval method and system based on sub-tree inverted index
CN115098534A (en) Data query method, device, equipment and medium based on index weight lifting
CN114637866A (en) Information management method and device for digital new media
Simoff et al. MDM/KDD2002: multimedia data mining between promises and problems
CN113707302A (en) Service recommendation method, device, equipment and storage medium based on associated information
Fahmi et al. SWHi system description: A case study in information retrieval, inference, and visualization in the semantic web
CN113590845A (en) Knowledge graph-based document retrieval method and device, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant