WO2014207827A1

WO2014207827A1 - Data analysis device, rdf data expansion method, and data analysis program

Info

Publication number: WO2014207827A1
Application number: PCT/JP2013/067418
Authority: WO
Inventors: 安田　知弘
Original assignee: 株式会社日立製作所
Priority date: 2013-06-25
Filing date: 2013-06-25
Publication date: 2014-12-31
Also published as: JP6001173B2; JPWO2014207827A1

Abstract

During a SPARQL search query, a SPARQL search query variable associated with a frequently compared value is extracted in order to associate data from multiple information sources. A new node created by linking the value associated with this variable is added to the RDF data. By including the newly added data as a search target during the search, the need to acquire individual values is eliminated, and the search process is accelerated.

Description

Data analysis apparatus, RDF data expansion method, and data analysis program

The present invention relates to a search for RDF data by a SPARQL search query, and particularly relates to a search for RDF data derived from a plurality of information sources.

Today, a variety of electronic data is being created in every field of society. Finding useful knowledge from such a vast amount of data is an important issue in data analysis technology using computers. There are many kinds of data, but the data itself is usually an enumeration of numerical values and character strings, and can only be used by giving meaning to them. As a framework devised to express data together with meaning and other data, there is resource description framework (RDF) data formulated and recommended by World Wide Web Consortium (W3C) (Non-Patent Document 1) ). RDF expresses things and their relations as a set of three values (hereinafter referred to as triples) of the thing 1 (hereinafter referred to as S), the kind of relation (hereinafter referred to as P), and the thing 2 (hereinafter referred to as O). One triple O can be another triple S, and S can also be another triple O. Therefore, RDF data is represented by a directed graph. A directed graph is a point connected by a line with a direction. In a directed graph, points are called nodes and lines are called edges. In the directed graph of RDF, nodes and edges are given uniform identifiers (URI), which are identifiers of S and O, and can identify arbitrary things. The URI of the node represents the thing corresponding to the node, and the URI of the edge represents the relationship between the connected things. A directed graph based on RDF data can be constructed by creating an edge with P as a label in the direction from the URI of S to the URI of O.

Patent Document 1 discloses a technology for comparing resources (S or O) and assigning the same URI to the resources when both resources are determined to be the same.

What is required when handling a large amount of RDF data is a process of searching for information required by the user from a large amount of RDF data and presenting a location that matches the search condition.

As a search query specification for searching RDF data, a standard called SPARQL Protocol and RDF Query Language (SPARQL) has been formulated and recommended by the World Wide Web Consortium (W3C) (Non-patent Document 2). SPARQL describes a partial structure that satisfies a search condition in an RDF graph structure. It is important to speed up a search query described in SPARQL when utilizing RDF data.

JP 2006-302085 A

When searching RDF data, there is a case where it is desired to search by combining RDF data obtained from a plurality of information sources. For example, the name, date of birth and address are recorded in the RDF data obtained from the information source 1, and the name, date of birth and occupation are recorded in the RDF data obtained from the information source 2. Suppose that Therefore, referring to the RDF data of each

information source

1 and 2, the data whose name and date of birth match are regarded as the same person, and both data are integrated to obtain a list of names, addresses and occupations. Corresponds to a process of searching by combining RDF data obtained from a plurality of information sources. In such a process, when the information source 1 and the information source 2 are increased, the number of places on the RDF graph where it is necessary to consider whether or not the query matches is increased, and the processing time increases.

The method described in Patent Document 1 provides a means for searching data stored in RDF (multiple information sources) for what seems to be equal, and integrating such data. . However, there is often a risk of errors in determining identity, affecting the original information source in the process of reconstructing RDF data, and integrating different data or overlooking what should be integrated. There is a fear.

The main problem to be solved by the present invention is that a large amount of RDF data obtained from a plurality of information sources is associated with each other by a search query described in SPARQL, and can be searched at high speed. Is to prevent influence.

The present invention includes a plurality of means for solving the above-described problems. To give a typical example, a data analysis apparatus can execute a SPARQL search query for RDF data provided from a plurality of information sources. A data analysis device for searching, comprising: a variable that matches a character string, a numerical value, or a date for associating a node included in a first information source with a node included in a second information source from the SPARQL search query; Query analysis means for extracting a set as a set called a comparison target variable set, and a variable matching the node included in the first information source and the node included in the second information source from the SPARQL search query. Corresponding variable calculation means for selecting each and selecting as a variable called a corresponding variable, and a SPARQL search class input to the processor. The comparison target variable set and the corresponding variable are frequently used, and a string, numerical value, or date value to be matched by the comparison target variable set can be determined in advance. A node adding means for generating a URI composed by combining the character strings sandwiched as a new node, connecting the node matching the corresponding variable and the URI of the new node, and extending the RDF data; Search means for searching a SPARQL search query for the RDF data, wherein the search means searches the expanded RDF data in addition to the SPARQL search query for searching the original RDF data. A SPARQL search query is configured to be searchable.

According to the present invention, a large amount of given RDF data obtained from a plurality of information sources can be searched as RDF data expanded and associated with each other without affecting the original information sources. So you can search quickly.

Issues, configurations, and effects other than those described above will be clarified by the following description of the embodiments.

The figure which shows the structural example of the data analyzer which concerns on 1st embodiment of this invention. The figure explaining an example of RDF data which consists of a plurality of information sources. The figure which shows the outline | summary of the process by the data analyzer of 1st embodiment. The figure which shows the data flow concerning a process in the data analyzer of 1st embodiment. The figure explaining an example of RDF which consists of a plurality of information sources and contains a larger amount of data than the example of FIG. The figure explaining an example of a SPARQL search query. The figure explaining the subgraph structure used as a search object by the SPARQL search query of FIG. The flowchart which shows the control logic of the query analysis means of 1st embodiment. The flowchart which shows the control logic of the corresponding variable calculation means of 1st embodiment. The flowchart which shows the outline | summary of the control logic of a node addition means of 1st embodiment. The flowchart which shows the control logic of the simplification query construction process of the node addition means of 1st embodiment. The flowchart which shows the control logic of the process which adds a node using the simplification query of the node addition means of 1st embodiment. The figure explaining an example of the simplification query of 1st embodiment. FIG. 6 is a diagram for explaining an example of expanded RDF data in which nodes are added to the RDF of FIG. 5 by a node addition unit. The figure explaining an example of medical data based on 2nd embodiment of this invention. The figure explaining an example of the SPARQL query which makes medical data object. The figure explaining an example of the state which added the additional node to the medical data of 2nd embodiment. The flowchart which shows the control logic of the means which automatically speeds up the inputted SPARQL query based on 3rd embodiment of this invention.

The present invention is applicable to improving the performance of a system that uses RDF data obtained based on a plurality of information sources in a transverse manner.
Embodiments of the present invention will be described below with reference to the drawings.

Hereinafter, a first embodiment of the present invention will be described with reference to the drawings.
FIG. 1 is a diagram illustrating a configuration example of a data analysis apparatus 100 according to the first embodiment of the present invention. The data analysis apparatus 100 includes a CPU (Central Processing Unit) 101, a main storage device (memory) 102, an auxiliary storage device 103, a removable medium 104, and a user interface unit 106. The data analysis apparatus 1 is connected to an external network via a network 105 such as a LAN (Local Area Network). The main storage device 102 holds various programs executed by the CP and various data necessary for the CPU 101 to execute these programs. The main storage device 102 is a storage device such as a RAM (Random Access Memory) that stores at least a data analysis program and RDF data 111 (1) that is an input to the data analysis program and is a search target. When the CPU 101 executes the data analysis program stored in the main storage device 102, the computer is caused to function as the query analysis means 107, the corresponding variable calculation means 108, the node addition means 109, and the search means 114. The query analysis unit 107, the corresponding variable calculation unit 108, and the node addition unit 109 constitute an extended node addition unit 110 as a whole, and convert the original individual RDF data into expanded RDF data by adding the expansion node.

The auxiliary storage device 103 is a storage device such as an HDD capable of recording the RDF data 111 (2) and the like. The removable medium 104 is a recording medium such as a CD-ROM or DVD that can record RDF data 111 (3) or the like. Each data recorded in the auxiliary storage device 103 and the removable medium 104 is read into the main storage device 102 when the data analysis device 1 is started up as necessary. Each RDF data 111 includes a plurality of information sources 113.

The user interface unit 106 is an input / output device (for example, a keyboard, a mouse, a display) that provides a user interface.

In the apparatus configuration described above, the CPU 101 acquires the RDF data 111 (4) and the like as needed from the outside via the main storage device 102, the auxiliary storage device 103, the removable medium 104, or the network 105. Thereafter, a node for speeding up the search, which will be described later, is added to the acquired RDF data 111 or a search by SPARQL is performed.

FIG. 2 is a diagram for briefly explaining an example of the RDF data 111 including the two

information sources

113a and 113b. In the RDF data 111 obtained from the information source 1 (113a), how the nodes having the identifier of ex1: person1 as the URI and the nodes in which (Alice), (19800101), and (London) are recorded respectively Are connected by an edge that indicates whether the relationship exists (name, date of birth, address). Similarly, in the RDF data 111 obtained from the information source 2 (113b), a node having the identifier ex2: customer1 as a URI and each node in which the name, telephone number, date of birth, and occupation are recorded are It is tied indirectly or directly. However, even if the two

information sources

1 and 2 include, for example, the same name (Alice), there is no edge indicating whether or not they are the same person in the RDF data at this stage.

The extended node adding unit 110 of the present invention performs a process of adding a new node and edge when there is a specific relationship between nodes existing in different RDF data.

FIG. 3 is a diagram showing an overview of processing for performing query analysis and node addition according to the first embodiment of the present invention, and FIG. 4 is a diagram showing a processing sequence in which the processing means cooperate. FIG. 5 is an example of RDF including a larger amount of data than the example of FIG.

First, the outline of each process in FIG. 3 will be described.
(1) Input of RDF data and search by search means 114 The RDF data 111 is stored in advance as a plurality of information sources in the main storage device 102, the auxiliary storage device 103, or the like. Further, the user inputs a SPARQL search query (hereinafter referred to as a SPARQL query) 400 to the CPU 101 via the user interface unit 106, and based on this, the search means 114 searches for the SPARQL query for the RDF data 111 of a plurality of information sources. The search result is held in the main storage device 102 and also output to the user interface unit 106.
(2) Query analysis means 107
The query analysis unit 107 analyzes a query described in SPARQL, and obtains a set of variables that match values to be compared in order to obtain corresponding data from a plurality of information sources (hereinafter, information sources 1 and 2). get. For example, in the RDF examples of FIGS. 2 and 5, the data to be associated are nodes such as ex1: person1 and ex2: customer1, and the values to be compared are the name and date of birth.

The query analysis means 107 receives a SPARQL query 400 as shown in FIG. The SPARQL query 400 is a query for searching a partial structure matching the graph as shown in FIG. 7 from the RDF data. This graph is hereinafter referred to as a query graph. In FIG. 6 and FIG. 7, variables such as “? Target_a” that begin with “?” And continue with alphabets and “_ (underscore)” are variables, such as RDF data strings, numerical values, dates, and other nodes. Can be matched. In the SPARQL query 400 of FIG. 6, the name and date of birth are obtained from the graph of the information source 1 on the left of FIGS. 2 and 5, and the name and date of birth are also obtained from the graph of the information source 1 on the right of FIG. Get the day.

The query analysis means 107 determines whether each variable of the SPARQL query 400 is a variable to be compared in order to obtain corresponding data from a plurality of information sources by the method shown in FIG. Generate a set of determined items.

The determination procedure will be described with reference to the control logic of FIG. In the following description, the variable name to be determined is? X.
S801: Determine whether? X matches a character string, numeric value, or date constant. If it matches anything other than a constant, it will not be a comparison target variable.
S802: It is determined whether or not a node or edge with a URI that exists only in one information source and? X can be connected by a path using only P existing in this information source. If it cannot be connected,? X is not a comparison variable.
S803: It is determined whether or not a node or edge with a URI that exists only in the other information source and? X can be connected by a route using only P existing in that information source. If it can be connected,? X is used as a comparison target variable.
S804: When? Y is a variable other than? X, if? X does not appear in the filter condition in the form of "filter (? X =? Y)" or "filter (? Y =? X)", Not to be a comparison target variable.
S805: It is determined whether the above? Y can be connected to a node or edge to which a URI that exists only in the other information source is given by a route using only P that exists in the information source. If it can be connected, it is set as a comparison target variable, and if it cannot be connected, it is not set as a comparison target variable.
This will be specifically described with reference to the example of the subgraph structure of FIG. A case where it is determined whether? name should be a comparison target variable will be described. ? name matches a string. P (ex1: addr) that exists only in the information source 1 can be reached by P (? Ex1: name, ex1: addr or ex1: date_of_birth) existing in the information source 1. In addition, P (for example, ex2: precord) that exists only in the information source 2 can be reached by P (? Ex2: name, ex2: birthday, ex2: precord or ex2: workFor) existing in the information source 2. Therefore,? Name is a variable to be compared. On the other hand, the case where it is determined whether? Birthday1 should be a comparison target variable will also be described. ? birthday matches the date. Furthermore, ex1: addr, which exists only in the information source 1, can be reached by P existing in the information source 1. ? birthday1 is compared with? birthday2 in the filter, but? birthday2 is reachable to ex2: workFor, which exists only in information source 2, with P existing in information source 2. Therefore,? Birthday1 is also a comparison target variable.

In this way, it is determined whether or not each variable included in the query 400 is a comparison target when associating data from different information sources. A set of variables determined as comparison targets is hereinafter referred to as a comparison target variable set.
(3) Corresponding variable calculation means 108
Corresponding variable calculation means 108 selects one corresponding variable that matches the node from each of the two information sources associated with the comparison target variable set. Although the user may select the user via the user interface unit 106, the calculation may be automatically performed as follows. First, as the corresponding variables, only the triples of each information source can be used to reach the variables of all the comparison target variables by the path (directed path) that sequentially follows the triples in the direction of S → O. If there is, it is preferable to select it. For example, in the example of FIG. 7,? Target_a and? Target_b are applicable. If there is no such node, a node having the smallest possible number of triples in the reverse direction (the direction of O → S) is selected.

This process will be described in detail using the control logic of FIG.
S901: The variable U is set to the comparison target variable set, and the variable i is set to 1. In the following, corresponding variables are selected for each information source. i is a variable that varies from 1 to 2 corresponding to each information source.
S902: Let Vi be a set of variables in the query graph connected to P existing in the i-th information source. For example, in the example of FIG. 7,? Addr,? Name,? Birthday,? Target_a connected to ex1: name, ex1: date_of_birth, ex1: addr of the information source 1 are Vi elements. Hereinafter, the process proceeds so that? V2 is a candidate for the corresponding variable. Set variable D to infinity (∞) as the value.
S903:? V is the jth variable of Vi. j is a variable that varies from 1 to a value that matches the magnitude of Vi. The variable d is initialized with 0. In addition, set k to 1. k is a variable that varies from 1 to a value that matches the size of U.
S904: Let? U be the kth variable of U.
S905: Find the shortest route from? V to? U, passing only P of information source 1. For example, in the example of FIG. 7, the shortest path from? Addr to? Name is? Addr-? Target_a-? Name. This path is hereinafter referred to as p. For identification of the shortest path, for example, a known Dijkstra algorithm (Cormen et al., Introduction to algorithms 3rd edition, the MIT press, pages 658-662) can be used.
S906: In p, the number of edges in the direction along the directed path? V →? U is e1, and the number of edges in the opposite direction is e2. In the example of p above, e = 1 is the only edge along the directional path? V →? U because ex1: name is the only edge, and e2 = 1 is the only edge opposite the ex1: addr. The score of this path p is calculated by e1 + e2 × r. Here, r is a parameter given by the user. If r is set to ∞, reverse edges can be prohibited. If reverse edges are allowed, let r be a finite value. This score is compared with the maximum score d of the route calculated so far, and the larger one is set as the value of d.
S907: Change k and process all comparison target variables.
S908: If D> d, D ← d,? V2 ←? V.
S909: Change j and process all Vi variables.
S910: If all variables of Vi are processed, if D is not ∞,? V2 stores the variable with the smallest maximum value of the path to the variable to be compared. This variable is output as a corresponding variable of the information source i. If D is ∞, it means that there is no variable that can reach the comparison target variable, so the corresponding variable of the information source i is not output.
S911: Change i and process both sources.
(4) Node addition means 109
The node adding means 109 adds a node for speeding up the search to two information sources and expands the RDF data in order to efficiently process the frequently used comparison target variable set and speed up the search. To do. For example, as a means for determining “frequently used”, when f is a parameter given by the user, the ratio of queries in which the comparison target variable set in all the SPARQL queries input is used is It is possible to consider a method in which it is determined that a thing of f or more is frequently used.

Details of the node adding means will be described with reference to the control logic of FIG.
S1001: First, a comparison target variable set for which a search is to be speeded up by adding a node is selected.
S1002: Next, a simplified query in which conditions unnecessary for adding a node are deleted from the SPARQL query used when calculating the comparison target variable set. A method of creating a simplified query will be described later with reference to FIG.
S1003: A node is added using a simplified query. This method will be described later with reference to FIG.

Details of the simplified query creation method (S1002 in FIG. 10) will be described with reference to FIG.
S1101: A variable U is a set of variables to be compared selected in S1001, V is a set of corresponding variables calculated by the corresponding variable calculation means 108, and Q is a SPARQL query used to obtain U and V. Hereinafter, the variable i is changed from 1 to 2, and processing is performed for both information sources.
S1102: Let Vi be a set of variables connected on the query graph to P existing in the i-th information source. Also,? V is the corresponding variable of the i-th information source. Furthermore, the variable S is initialized to an empty set.
S1103: Let? U be the kth variable of U. Thereafter, k is changed from 1 to a value equal to the size of U, and processing is performed for all comparison target variables.
S1104: As in S905, the shortest path p connecting? V and? U is obtained. By using the same method as S905, the same route as S905 is obtained.
S1105: Add all variables appearing in path p to S.
S1106: Change k and process all comparison target variables.
S1107: Change i and process both information sources.
S1108: Delete triples including variables not in S and filter conditions from Q. Furthermore, the variable written immediately after select is deleted, and the variable of the comparison target variable set and the corresponding variable are added. The SPARQL query thus obtained is output as a simplified query Q ′.

FIG. 13 shows an example of the simplified query 500. The simplified query 500 is obtained by erasing unnecessary conditions in the SPARQL query 400 in order to associate a comparison target variable set and a corresponding variable whose search speed is to be increased by extending RDF data.

Details of the node addition process using the simplified query (S1003 in FIG. 10) will be described with reference to FIG.
S1201: A variable U is a set of variables to be compared selected in S1001, and V is a set of corresponding variables calculated by the corresponding variable calculation means 108. In addition, search processing by the simplified query Q ′ is executed, and the obtained search result is set as B. Furthermore, the base URI of the node to which I is added and the URI of P to which J is added. I and J are parameters given by the user. The variable i is set to 1 and the variable s is set to the character string “_”. Note that i is a variable that changes from 1 to a value equal to the size of B.
S1202: Let b be the i-th search result included in B. Also, variables j and k are set to 1 and x is set to an empty character string.
S1203: If the variable x is not an empty character string, the character string of the variable s is added to the right end of x.
S1204: Let? U be the k-th variable of U. In the search result b, the value associated with the variable? U is added to the right end of x.
S1205: The variable k is changed from 1 to a value equal to the size of U, and all comparison target variables are processed.
S1206: A new node is created, and the URI of this node is set to I / x /.
S1207: When n is the URI of the node to which the jth variable of V is assigned in the search result b, a triple “<I / x /><J><n>.” Is added to the RDF data. However, if this triplet already exists in the RDF data, no addition is performed.
S1208: Variable j is changed from 1 to a value equal to the magnitude of V, and all corresponding variables are processed.
S1209: Variable i is changed from 1 to a value equal to the size of B, and all search results are processed.

FIG. 14 is an example of expanded RDF data in which a new node is added to the RDF of FIG. At this time, the variable representing the comparison target variable set and the corresponding node are output together with the simplified query, and the node of the original RDF data is displayed on the user interface unit as to which node of the original RDF data has been added. The user can discriminate. In the example of FIG. 14, as shown by the thick frame, ex1: person1 of information source 1 which is the original data and ex2: customer of information source 2 are compared by comparing the names and birthdays of

information sources

1 and 2 in FIG. It is determined that 1 同一 is the same person Alice, and a new node 1001 extended to ex1: person1 and ex2: customer 1, that is, ex: Alice_19800101 is added to form expanded RDF data. . Similarly, if there are other nodes that are determined to be the same person in the

information sources

1 and 2 of the original data, a new node may be added. Further, the extended RDF data may be formed between more information sources than the two information sources.

In the RDF data of FIG. 5, it is necessary to individually compare the character strings and dates of the

information sources

1 and 2 in the search, and it takes time to specify all the same persons. The larger the RDF data, the greater the problem. By using the RDF data expanded by the node adding unit 109 for the search, it is possible to speed up the SPARQL query including the comparison target variable set that is the expansion target. By using the newly added RDF data as a search target at the time of search, there is no need to acquire individual values as in the conventional case, and the search process is speeded up.

Specifically, when the user inputs a search query by the search means 114, the generated triple set “<I / x /> <J> <” is used instead of describing the comparison target variable set in the SPARQL query. n>. ”, that is, the expanded new node 1001 may be designated as a search condition. This speeds up the search process by the processor.

In addition, the addition of a new node does not require the integration of the RDF data of the

original information sources

1 and 2. In other words, the RDF data of the

original information sources

1 and 2 remains as they are without any influence. Yes. Unlike the integration of data, even if an extension node is added, the search of the information source 1 and the information source 2 by the conventional program is not affected at all. Since only new nodes and edges are added, it is possible to perform a search under a condition that does not include a new node using a conventional SPARQL query by a conventional program by the search unit 114. For example, the conventional program itself is not changed, and it is sufficient to add a function for adding a new node to the expanded RDF data of this embodiment or a function for searching for a new node in the preprocessing of the program. .

Since the original RDF data of the

original information sources

1 and 2 can be used as they are, in the example of FIG. 14, even though the name and the birthday coincide, the extension node Alice is not the same person. When a case arises, it is possible to search again using the original RDF data of the

original information sources

1 and 2 and generate an appropriate extended node with more accurate information.

As described above, according to the present embodiment, a large amount of RDF data given from a plurality of information sources is searched as RDF data expanded and associated with each other without affecting the original information sources. Because it is a target, it becomes possible to search faster.

Next, a second embodiment in which the present invention is applied to medical data will be described with reference to the drawings.
In the current medical practice, it comes from multiple sources such as electronic medical record data, test values, medical image data and additional information called metadata attached to them, and medical accounting data required for requesting medical fees Need to handle data. By handling such data as RDF, it is expected that RDF data processing technology can be used for medical data analysis. At this time, the data analysis method using the SPARQL query of the present invention can be applied as means for speeding up the search processing for medical data converted to RDF.

An example of the processing target data is shown in FIG. In this example, medical accounting data converted to RDF and electronic medical record data converted to RDF are used across. Medical accounting data can be obtained, for example, from receipt information data submitted by the medical institution to the Ministry of Health, Labor and Welfare (refer to the 2013 “Survey on Impact Assessment of DPC Introduction”). The information source 1 (113a) on the left side of FIG. 15 is an example of the RDF graph 111 derived from medical accounting data, where account: ID is the patient ID, account: admission_date is the date of hospitalization, and account: point is used to calculate medical fees. Represents the number of points used. On the other hand, the information source 2 (113b) on the right side of FIG. 15 is an RDF graph 111 derived from an electronic medical record, where echart: ID is the patient ID, echart: date_admission is the hospitalization date, and echart: diagnosis represents the diagnosis.
Here, a process of obtaining the medical accounting data (113a) score for a case diagnosed as myocardial infarction in the electronic medical record data (113b) will be considered. At this time, it is not sufficient to simply collate the medical accounting data with the patient ID of the electronic medical record because there may be a patient who has been discharged and discharged multiple times. It is also necessary to collate the hospitalization date at the same time. Therefore, it is necessary to frequently process a SPARQL query including the patient ID and the hospitalization date as search conditions, such as the SPARQL query 450 in FIG. For each of these queries, a process that searches for a patient ID that matches and then searches for a patient with the same hospitalization date will result in many unnecessary partial matches being considered for patients who are repeatedly entering and leaving the hospital. I will have to. However, when medical accounting data and electronic medical records are integrated, patient IDs and date of hospitalization are frequently required.

FIG. 17 is a diagram illustrating an example of a state in which an additional node is added to medical data according to the second embodiment of this invention. Using the method described in the first embodiment of the present invention, a node 1201 in which the patient ID and the hospitalization date are collected is generated, and is added to the RDF graph to be expanded RDF data. By collating the case of the electronic medical record (113b) and the record of the medical accounting data (113a) via this node 1201, the processing for searching both (113a, 113b) can be accelerated. For example, a patient collated with the expanded new node: 135791_20240608 is a case where the patient ID (135791) and one of multiple hospitalization dates (20240608) match the medical accounting data and the electronic medical record. Using this result, the electronic medical record can easily obtain the medical accounting data score of a case diagnosed as myocardial infarction. Extended RDF data may be formed with more other medical data.

As described above, according to this embodiment, a large amount of RDF data given from a plurality of medical data information sources is searched as RDF data expanded and associated with each other without affecting each information source. Since it can be made a target, the user can search medical data quickly according to the application.

Next, as a third embodiment of the present invention, an example of speeding up by automatic rewriting of a search query will be described.
In the first embodiment, it is necessary to rewrite the SPARQL query input by the user using the search unit 114 in order to speed up the search process using the RDF data expanded by the node addition unit 109. However, such a method places a burden on the user, and the possibility of an error during rewriting cannot be ignored. Therefore, the present invention provides a means for automatically rewriting SPARQL of a search query to increase the speed.

In order to realize this, in the third embodiment, a program for causing a computer to function as SPARQL automatic rewriting means is stored in the main storage device 102 of the data analysis apparatus 100 described with reference to FIG. 1 regarding the first embodiment. Has been. The SPARQL query input by the user using the search unit 114 is automatically rewritten to a condition for a node in which a condition that matches the simplified query is newly added. Other configurations are the same as those of the first embodiment.

First, based on the SPARQL query given by the user, a query graph expressed by the SPARQL query is constructed. As described in the first embodiment, FIG. 7 is an example of a query graph generated from the SPARQL query 400 of FIG.

Next, a search using the simplified query 500 as shown in the example of FIG. 13 used by the node addition unit 109 is performed on this graph. If the simplified query matches, it is expanded by deleting the matched part and automatically including the node added by the node adding means 109 in the search condition input by the user automatically by the SPARQL automatic rewriting means. Search using RDF data becomes possible.

Details of this processing will be described with reference to FIG.
S1801: U ← set of variables to be compared, V ← set of corresponding variables, Q ′ ← simplified query, q ← input search query, initialize each variable so that i ← 1.
S1802: Constructing a query graph for q. This query graph is called g below.
S1803: If Q 'does not match g, q is searched without modification.
S1804: Copy comparison target variable set U to variable S. Thereafter, S1805 to S1807 are repeated until S becomes an empty set.
S1805: Extract one variable from S. Let that variable be? X. Remove? X from S. If an edge that does not match Q 'in g is connected to? x, no more? x is processed and the process proceeds to the next variable.
S1806: If? X is described immediately after select,? X is not processed any more and proceeds to the next variable. This is because the variable described immediately after select is a variable necessary for the output of the SPARQL query q and cannot be replaced.
S1807: In g, add all variables directly connected to? X by an edge to S. Furthermore, the triple including? X is deleted from q.
S1808: If the filter condition variable of q does not appear in the rewritten q triple, the filter condition is deleted.
S1809: For each of i = 1 and 2, when? V is the i-th corresponding variable of V, a triple “? Ident <J>? V.” Is added to the query.

The automatic rewriting of SPARQL using the expanded RDF data as a search target is automatically processed in response to the user inputting a SPARQL query to the CPU 101 by the search unit 110 in FIG. 4, and based on the result. A search for the RDF data 111 in the main storage device 102 is executed. Therefore, the user can search the extended RDF data at high speed by using the original SPARQL query as it is, that is, without rewriting the extended RDF data into a SPARQL query that is a search target.

As mentioned above, although each embodiment of the present invention was described, the above-mentioned embodiment shows an example of application of the present invention, and is not the meaning which limits the technical scope of the present invention to the concrete composition of each above-mentioned embodiment. . Various modifications can be made without departing from the scope of the present invention.

100 Data Analysis Device 101 CPU (Central Processing Unit)
102 Main storage device 103 Auxiliary storage device 104 Removable media 105 Network 106 Interface unit 107 Query analysis unit 108 Corresponding variable calculation unit 109 Node addition unit 111 RDF data 400 Example of SPARQL query 450 Example of SPARQL query for medical data 500 Simplified query Example 1001 Example of node added by node addition means 1201 Example of node added by node addition means to medical data

Claims

A data analysis device for searching a SPARQL search query for RDF data given from a plurality of information sources,
From the SPARQL search query, a set of variables matching a character string, a numerical value, or a date for associating a node included in the first information source with a node included in the second information source is referred to as a comparison target variable set. Query analysis means to extract as a set;
Corresponding variable calculation means for selecting a variable matching the node included in the first information source and the node included in the second information source from the SPARQL search query, and selecting as a variable called a corresponding variable;
The SPARQL search query including the query and input to the processor is analyzed, the frequently used one of the comparison target variable set and the corresponding variable is calculated, and the comparison target variable set should be matched A new node is generated by combining character strings, numerical values, and date values with a predetermined character string interposed therebetween, and the node matching the corresponding variable is connected to the URI of the new node. Node adding means for extending the RDF data,
Search means for searching a SPARQL search query for the RDF data,
The search means is configured to be able to search for a SPARQL search query that uses the expanded RDF data as a search target in addition to the SPARQL search query for searching the original RDF data. apparatus.
The data analysis device according to claim 1,
The node adding means includes
Selecting the comparison target variable set that is desired to speed up the search by extension, and creating a simplified query in which conditions unnecessary for node addition are deleted from the SPARQL query used when calculating the comparison target variable set. Features data analysis equipment.
The data analysis device according to claim 1,
Equipped with SPARQL automatic rewriting means,
The SPARQL automatic rewriting means is:
Even when a SPARQL search query for searching the original RDF data is input, the condition relating to the comparison target variable set is automatically replaced with the condition for the URI of the new node added by the node adding means. A data analysis device featuring
The data analysis device according to claim 1,
The corresponding variable calculation means includes
From each of the first information source and the second information source associated with the comparison target variable set, as the variable corresponding to the node to be correlated by the comparison target variable set, all the above-mentioned by directed paths A data analysis apparatus characterized in that if a variable in a comparison target variable set is reachable, it is selected.
The data analysis apparatus according to claim 4, wherein
The corresponding variable calculation means includes
If there is no node that can reach all the variables of the comparison target variable set by the directed path, the number of nodes on the path to each variable of the comparison target variable set is the minimum, if any. Data analysis device characterized by selecting.
The data analysis apparatus according to claim 4, wherein
A user interface unit that provides a user interface;
The corresponding variable calculation means includes
A data analysis apparatus characterized in that an interface for a user to select a variable to be associated is provided to the user interface unit.
A data analysis apparatus according to claim 2, wherein
A user interface unit that provides a user interface;
The node adding means includes
When the expanded RDF data to which the new node is added is generated, the variable representing the comparison target variable set and the corresponding node is output together with the simplified query, and for which node A data analysis apparatus characterized in that whether a new node has been added is displayed on the user interface unit and is in a state where the user can discriminate.
The data analysis device according to claim 1,
The data analysis apparatus, wherein the RDF data is medical data derived from a plurality of information sources.
The data analysis apparatus according to claim 8, comprising:
The RDF data includes medical accounting data converted into RDF and an electronic medical record,
The node adding means adds a node that summarizes the patient ID and hospitalization date as the new node to generate expanded RDF data,
The data analysis apparatus characterized in that the search means performs a search using the RDF-converted medical accounting data and the RDF-converted electronic medical record data across the new node.
The data analysis device according to claim 1,
The node adding means includes
As a condition for determining “frequently used”, when f is a parameter given by the user, a ratio of queries in which the comparison target variable set in the input SPARQL query is greater than or equal to f Is a data analysis device characterized in that it is determined as “used frequently”.
A method for extending RDF data provided from a plurality of information sources by a data analyzer,
The data analysis apparatus includes a processor and a memory, and searches for a SPARQL search query for RDF data provided from a plurality of information sources.
From the SPARQL search query, a set of variables matching a character string, a numerical value, or a date for associating a node included in the first information source with a node included in the second information source is referred to as a comparison target variable set. Extracting as a set;
Selecting, from the SPARQL search query, a variable that matches a node included in the first information source and a node included in the second information source, respectively, and selecting as a variable called a corresponding variable;
The SPARQL search query that includes the query and is input to the processor is analyzed, the frequently used one of the comparison target variable set and the corresponding variable set is calculated, and the characters to be matched by the comparison target variable set A URI composed of a string, a numeric value, and a date value combined with a predetermined character string interposed therebetween is generated as a new node, and the node matching the corresponding variable is connected to the URI of the new node. And a node addition step of extending the RDF data.
The method of extending RDF data according to claim 11, comprising:
In the node addition,
Select the comparison target variable set that you want to speed up the search by expansion, create a simplified query that eliminates conditions unnecessary for node addition from the SPARQL query used when calculating the comparison target variable set,
A method of extending RDF data, characterized by displaying on the user interface section what kind of node a new node has been added.
The method of extending RDF data according to claim 11, comprising:
Even when a SPARQL search query for searching the original RDF data is input to the data analysis device, the condition relating to the comparison target variable set is automatically set as the condition for the URI of the added new node. A method of expanding RDF data, comprising a replacement SPARQL automatic rewriting step.
There is a data analysis program for searching a SPARQL search query for RDF data provided from a plurality of information sources,
On the computer,
From the SPARQL search query, a set of variables matching a character string, a numerical value, or a date for associating a node included in the first information source with a node included in the second information source is referred to as a comparison target variable set. A procedure to extract as a set;
From the SPARQL search query, selecting a variable that matches a node included in the first information source and a node included in the second information source, respectively, and selecting as a variable called a corresponding variable;
The SPARQL search query that includes the query and is input to the processor is analyzed, the frequently used one of the comparison target variable set and the corresponding variable set is calculated, and the characters to be matched by the comparison target variable set A procedure for generating a URI configured by combining columns, numerical values, and date values with a predetermined character string interposed therebetween, and extending the RDF data by connecting the URI with a node matching the corresponding variable; Data analysis program characterized by running
15. A data analysis program according to claim 14, comprising:
Even when a SPARQL search query for searching the original RDF data is input to the computer, the SPARQL automatic replacement automatically replaces the condition for the comparison target variable set with the condition for the URI of the added new node. A data analysis program characterized by having a rewrite procedure executed.