WO2022259303A1

WO2022259303A1 - Name data association device, name data association method, and name data association program

Info

Publication number: WO2022259303A1
Application number: PCT/JP2021/021548
Authority: WO
Inventors: まな美小川; 正崇佐藤
Original assignee: 日本電信電話株式会社
Priority date: 2021-06-07
Filing date: 2021-06-07
Publication date: 2022-12-15
Also published as: JPWO2022259303A1

Abstract

A name data association device according to one embodiment includes: a common data extraction unit that extracts, as common data, name data with the same representation between a first database (DB) and a second DB, the first DB holding a plurality of pieces of name data and adjacent information indicating adjacent relationships between the name data, the second DB holding the plurality of pieces of name data, adjacent information between the name data, and path identification information representing paths to which these name data belong; a path generation unit that extracts partial paths from a path represented by the path identification information held by the second DB, the partial paths having the common data as endpoints and having non-common data as a vertex between the endpoints, and generates, on the basis of the information held by the first DB, for each of the partial paths, a path having the same common data endpoints as the endpoints of the partial paths and a length equal to or larger than that of the partial paths; and an association unit that searches a combination of the vertexes on the partial paths and a vertex on the path for each of the partial paths to thereby associate the name data between the first DB and the second DB.

Description

Name data association device, name data association method, and name data association program

Embodiments of the present invention relate to a name data association device, a name data association method, and a name data association program.

In the work using databases, we may integrate databases under different management and use the stored name data side by side to conduct more diversified and comprehensive analysis. For this purpose, it is necessary to integrate the name data by giving the same identification information to the name data representing the same matter in the databases to be integrated, which is called "name matching".

However, the method of entering name data depends on the database manager. Therefore, a situation (notation variation) often exists in which the notation differs even though the same thing is expressed in the databases to be integrated. If a database containing spelling variations is integrated, when performing the above-mentioned analysis, there will be a situation in which the information related to one matter is insufficient only for the part where the spelling variations have occurred.

As a technique for coping with such spelling variations, Non-Patent Document 1 and Non-Patent Document 2 disclose a method of searching for the most similar character string by quantitatively calculating the degree of similarity between character strings to be searched. is proposing. Non-Patent Document 3 proposes a method of accurately and efficiently searching for character strings representing the same matter by creating a search dictionary. Further, Patent Literature 1 discloses a linking method using peripheral information of data for which name identification is desired.

Japanese Patent Application Laid-Open No. 2020-123210

There are two types of notation: one that omits the registered data name, and another that uses a name (common name) based on local rules between users.

The methods disclosed in

Non-Patent Documents

1 and 2 are popular and effective means when only the former abbreviated notation exists as a notation variation. However, in the latter situation where common name notations are mixed, each common name is associated with a name that is similar in character string to the common name, so there is a high possibility of presenting an erroneous result. This is because, in many cases, common name notation is significantly different from the name that should be originally linked.

Also, even if only the former abbreviated notation is handled, the methods disclosed in

Non-Patent Documents

1 and 2 are created on the assumption that they will be used for Japanese, so the scope of application of the technology is limited. Since the characteristics of abbreviations in Japanese and the characteristics of other languages do not all match, the methods disclosed in Non-Patent

Documents

1 and 2 can be applied without problems to name data input in other languages. is not limited.

Therefore, it is considered that creating a dictionary, such as that disclosed in Non-Patent Document 3, is the optimal method for common name notation. However, as the number of databases to be integrated increases, the dictionary needs to be expanded accordingly, so there is a drawback that it takes time to cope with spelling variations.

Therefore, Japanese Patent Application Laid-Open No. 2002-200010 proposes a technique that makes it possible to identify common names by using information around data to be identified (data A and data B in the same database are related, etc.) without relying on a dictionary. there is However, the technology disclosed in Patent Document 1 is such that the graphs constructed from the name data of each database have a kind of inclusion relationship (an edge corresponding to an edge of one graph always exists in the other graph). must be satisfied). For this reason, there is a problem that it is difficult to collect name data from which a graph having a structure in which inclusion relationships are not maintained is obtained, or even if it is possible, a large number of candidate names are generated.

This invention seeks to provide a technology that can accurately associate synonymous name data that has notational variations between databases to be integrated without requiring human intervention.

To solve the above problems, a name data association device according to an aspect of the present invention includes a first database holding a plurality of name data and adjacency information indicating a logical or physical adjacency relationship between the name data. and a second database that holds a plurality of name data, adjacency information of the name data, and path identification information representing paths to which the name data belong. A device comprising a common data extraction unit, a path creation unit, and an association unit. The common data extraction unit extracts name data having the same notation between the first database and the second database as common data. The path creation unit extracts a partial path having common data extracted by the common data extraction unit as endpoints and non-common data as vertices between the endpoints, from the path represented by the path identification information held by the second database. , based on the information held by the first database, for each partial path, a path having common data end points identical to the end points of the partial path and having a length equal to or greater than the length of the partial path is created. For each partial path extracted by the path creation unit, the associating unit searches for combinations of vertices on the partial path and vertices on the path created by the path creation unit, thereby obtaining names held by the first database. The data and the name data held by the second database are associated with each other.

According to one aspect of the present invention, it is possible to provide a technology capable of accurately associating synonymous name data with spelling variations between databases to be integrated without requiring human operations.

FIG. 1 is a block diagram showing an example of the configuration of a name data association device according to one embodiment of the present invention. FIG. 2 is a diagram showing an example of the hardware configuration of the name data association device. FIG. 3 is a diagram showing an example of information held by a basic database stored in a basic database storage unit. FIG. 4 is a diagram showing an example of information held by a derived database stored in a derived database storage unit. FIG. 5 is a flow chart showing an example of processing operations related to association of name data in the name data association device. FIG. 6 is a schematic diagram for explaining a method of associating names. FIG. 7 is a diagram showing an example of information held by a basic database in an operation example. FIG. 8 is a diagram showing an example of information held by the derivative database in the operation example. FIG. 9 is a schematic diagram showing an example of a cycle graph created from information held in a basic database by the graph creating unit in the operation example. FIG. 10 is a schematic diagram showing an example of a path generated from a cycle graph created from information held in a derived database by the graph creating unit in the operation example. FIG. 11 is a diagram illustrating an example of output information stored in an output information storage unit in an operation example;

Hereinafter, embodiments according to the present invention will be described with reference to the drawings.

In this embodiment, it is assumed that multiple databases hold synonymous name data with different notations, and the data columns with which the name data are to be associated in these databases are already known. Each data column can include name data and string-specific data corresponding to the name data, such as measurement value, date and time of measurement, date and time of sales, and amount of sales. It is also assumed that each database holds logical or physical adjacency information indicating the adjacency relationship of name data. Here, the adjacency information indicating the adjacency relationship of the name data includes, for example, personal connections (person A and person B are acquaintances) and connection relationship on the network (building A and building B are connected by a cable). It refers to information on how data is connected to each other. Although the present invention does not particularly limit the number of databases with which name data is associated, in this embodiment, for the sake of simplicity of explanation, it is assumed that there are two target databases. It is also assumed that there is a connection relationship on the network between the name data in each database. Specifically, each database has columns named "upper building" and "lower building", and the name data stored in "upper building" and the name data stored in "lower building" are Indicates that they are adjacent. In addition, it is assumed that at least one of the plurality of databases is added with path identification information indicating the path to which the name data belongs, in addition to the adjacent information.

(Configuration example)
FIG. 1 is a block diagram showing an example of the configuration of a name data association device according to one embodiment of the present invention. The name data association device includes a basic database (in the figure, the database is abbreviated as DB) 1, a derivative database 2, a graph creation unit 3, a common data extraction unit 4, a path information extraction unit 5, a path creation unit 6, correspondence It has an attachment unit 7 and a data output unit 8 .

The basic database 1 is a first database that holds a plurality of name data and adjacency information indicating the adjacency relationship between the name data. The derivative database 2 is a second database that holds a plurality of name data, adjacency information of the name data, and path identification information representing paths to which the name data belong.

Based on the information held by the basic database 1 and the derived database 2, the graph creation unit 3 creates an undirected graph with name data as vertices.

The common data extraction unit 4 extracts name data that is written in the same way between the basic database 1 and the derivative database 2 as common data.

Based on the path identification information held by the derivative database 2, the path information extraction unit 5 takes one of the common data extracted by the common data extraction unit 4 as the starting point and sets the name data held by the derivative database 2 as the vertex. , to generate all paths. The end point of the path may have the same common data as the start point, or may have different common data from the start point. Then, the path information extraction unit 5 extracts path information including the number of vertices, the name data of the included vertices, and the position on the path for each of those paths. For example, the path information extraction unit 5 can extract path information based on the undirected graph created by the graph creation unit 3 and the path identification information held by the derivative database 2 .

The path creation unit 6 extracts partial paths having each common data as an end point, that is, a start point and an end point, for each path indicated by the path information extracted by the path information extraction unit 5 . Then, based on the information held by the basic database 1, the path creation unit 6 counts up all paths whose endpoints are common data that match each partial path and that have a prescribed length. For example, the path creation unit 6 can enumerate paths based on the undirected graph created by the graph creation unit 3 from the basic database 1 .

The associating unit 7 searches for a combination of vertex name data from the partial paths extracted by the path creating unit 6 and the counted paths, for example, based on the character string similarity such as the edit distance. Then, the associating unit 7 associates the name data held by the basic database 1 with the name data held by the derivative database 2 based on the searched combinations.

The data output unit 8 generates output information based on the result of association by the association unit 7 and outputs it. For example, the data output unit 8 can generate, as output information, a correspondence table representing the correspondence of name data based on the result of association by the association unit 7 . In addition, the data output unit 8 converts the name data of the information held in the basic database 1 based on the result of association by the association unit 7, creates a new database, and uses this as output information. You can make it work. Alternatively, the data output unit 8 integrates the information held by the basic database 1 and the derivative database 2 based on the result of association by the association unit 7, creates a new database, and outputs this as output information. You can also use

FIG. 2 is a diagram showing an example of the hardware configuration of the name data association device.

As shown in FIG. 2, the name data association device is composed of a computer such as a server computer or a personal computer, and has a hardware processor 101 such as a CPU (Central Processing Unit). In the name data association device, a program memory 102, a data memory 103, a communication interface 104, and an input/output interface (denoted as an input/output IF in FIG. connected through

The communication interface 104 can include, for example, one or more wired or wireless communication modules. If the basic database 1 and/or the derivative database 2 are configured in a data server or the like connected via a network such as a LAN (Local Area Network) or the Internet, the communication interface 104 is connected to the data server or the like. can communicate between and retrieve data from those data servers. Also, the communication interface 104 can communicate with an external data processing device or the like to receive a request from the data processing device, and can also send a data processing result corresponding to the request back to the data processing device. .

An input unit 107 and a display unit 108 are connected to the input/output interface 105 . The input unit 107 and the display unit 108 are so-called tablet-type inputs, in which an input detection sheet adopting an electrostatic method or a pressure method is arranged on a display screen of a display device using liquid crystal or organic EL (Electro Luminescence), for example. - using a display device can be used; Note that the input unit 107 and the display unit 108 may be configured by independent devices. The input/output interface 105 inputs operation information input from the input unit 107 to the processor 101 and displays display information generated by the processor 101 on the display unit 108 .

Note that the input unit 107 and the display unit 108 do not have to be connected to the input/output interface 105 . The input unit 107 and the display unit 108 are provided with a communication unit for connecting to the communication interface 104 directly or via a network, so that information can be exchanged with the processor 101 .

The input/output interface 105 may have a read/write function for a recording medium such as a semiconductor memory such as a flash memory, or may be connected to a reader/writer having a read/write function for such a recording medium. It may have functions. As a result, a recording medium detachable from the name data association device can be used as a database for holding name data. The input/output interface 105 may further have a connection function with other devices.

The program memory 102 is a non-temporary tangible computer-readable storage medium, for example, a non-volatile memory such as a HDD (Hard Disk Drive) or SSD (Solid State Drive) that can be written and read at any time, and a non-volatile memory such as a ROM. It is used in combination with a static memory. The program memory 102 stores programs necessary for the processor 101 to execute various control processes according to one embodiment. That is, the processing function units in each of the above-described graph creation unit 3, common data extraction unit 4, path information extraction unit 5, path creation unit 6, association unit 7, and data output unit 8 are all stored in the program memory 102. It can be realized by causing the processor 101 to read and execute the stored program. Some or all of these processing functions may be implemented in various other forms, including integrated circuits such as Application Specific Integrated Circuits (ASICs) or field-programmable gate arrays (FPGAs). May be.

The data memory 103 is a tangible computer-readable storage medium, for example, a combination of the above nonvolatile memory and a volatile memory such as RAM (Random Access Memory). This data memory 103 is used to store various data acquired and created in the process of performing various processes. That is, in the data memory 103, an area for storing various data is appropriately secured in the process of performing various processes. As such areas, the data memory 103 can be provided with, for example, a basic database storage unit 1031 , a derived database storage unit 1032 , a temporary storage unit 1033 and an output information storage unit 1034 .

The basic database storage unit 1031 stores information of the basic database 1, and the derived database storage unit 1032 stores information of the derived database 2. That is, the basic database 1 and the derivative database 2 can be configured in the basic database storage unit 1031 and the derivative database storage unit 1032 .

FIG. 3 is a diagram showing an example of information held by the basic database 1 stored in the basic database storage unit 1031, and FIG. 4 is an example of information held by the derived database 2 stored in the derived database storage unit 1032. It is a figure which shows. Here, an example is shown in which the name data is the name of a building. In the basic database 1 stored in the basic database storage unit 1031, the upper building and the lower building are adjacent to each other. In the derived database 2 stored in the derived database storage unit 1032, a combination of buildings having the same path identifier (identifier is abbreviated as ID in the figure) is represented by one path (Shinjuku Building → Minami-Shinjuku Building → Gaien Building → Yotsuya Building → Shinjuku Building). Hereinafter, the building names in the derived database 2 are denoted by c _i (i ∈ {1, 2, ..., n}), and the building names in the basic database 1 are denoted by d _j (j ∈ {1, 2, ..., m} ). where n and m are the number of building names in each database.

The information stored in the basic database storage unit 1031 and the derived database storage unit 1032 is, for example, the information of the basic database 1 and the derived database 2 input from the input unit 107 received by the processor 101 via the input/output interface 105. can be That is, a base database 1 and a derived database 2 can be constructed in the data memory 103 . Further, all or part of the information held by the basic database 1 and the derived database 2 constructed in an external data server may be stored in the basic database storage unit 1031 and the derived database storage unit 1032 . In this case, for example, the processor 101 acquires information accumulated in the database server via the communication interface 104 in response to an instruction by a user operation from the input unit 107, and stores them in the

storage units

1031 and 1032. Let Alternatively, processor 101 may acquire information recorded on a recording medium via input/output interface 105 . The processor 101 also receives a request for associating the information of the basic database 1 and the derived database 2 with the name data from an external data processing device or the like via the communication interface 104, and processes the received database information. It may be stored in the

storage units

1031 and 1032 as target information.

The temporary storage unit 1033 stores the undirected graph created when the processor 101 performs the operation as the graph creation unit 3, the common data extracted when the operation as the common data extraction unit 4 is performed, the path Path information about all paths extracted when the operation of the information extraction unit 5 is performed, partial paths and counted paths extracted when the operation of the path creation unit 6 is performed, and the correspondence unit 7 Stores the result of association of name data obtained when performing the operation of .

The output information storage unit 1034 stores output information obtained when the processor 101 operates as the data output unit 8 described above.

(motion)
Next, the operation of the name data association device will be described.

FIG. 5 is a flow chart showing an example of a processing operation related to association of name data in the name data association device. Here, it is assumed that the information of the basic database 1 is already stored in the basic database storage unit 1031 and the information of the derivative database 2 is already stored in the derivative database storage unit 1032 . When the input unit 107 via the input/output interface 105 or an external data processing device via the communication interface 104 instructs to perform name data matching, the processor 101 of the name data matching device The operation shown in this flow chart is started.

First, the processor 101 performs the operation as the graph creation unit 3 . That is, the processor 101 uses the adjacency information for each of the information of the derived database 2 stored in the derived database storage unit 1032 and the information of the basic database 1 stored in the basic database storage unit 1031 to extract the name data. Cycle graphs G _c and G _d to be vertices are generated (step S1). The generated cycle graphs G _c and G _d are stored in the temporary storage unit 1033 of the data memory 103 .

Building name c _i in the derived database 2 and building name d _j in the basic database 1 are each taken as vertices, and if it is interpreted that adjacent vertices are connected by edges, the following undirected graph is Certain cycle graphs G _c and G _d can be constructed. Here, a cycle is a subgraph of the cycle graph G _c and indicates a path whose start point and end point are the same vertex.

E _d : A set of edges obtained from the adjacency information of the basic database 1 g _d : E _d →P(V _d ) A mapping that associates a subset of the vertex set V _d with an element of _Ed . However, P(V _d ) is a power set of the vertex set V _d G _d :=(g _d , V _d , _Ed )

E _c : A set of edges obtained from the adjacency information of the derivative database 2 g _c : E _c →P(V _c ) A mapping that associates a subset of the vertex set V _c with the elements of E _c . However, P(V _c ) is a power set of the vertex set V _c G _c :=(g _c , V _c , E _c )

Also, the processor 101 of the name data association device operates as the common data extraction unit 4 . That is, the processor 101 extracts common name data between the information of the basic database 1 stored in the basic database storage unit 1031 and the information of the derived database 2 stored in the derived database storage unit 1032 (step S2). The extracted common name data is stored in temporary storage section 1033 of data memory 103 .

Next, the processor 101 executes the operation as the path information extractor 5. FIG. That is, the processor 101 extracts paths Γ _k ( _k∈ {1, 2, . . . , K}, K is the total number of paths in the cycle graph _Gc ) (step S3). Path information indicating the extracted path Γ _k is stored in the temporary storage unit 1033 of the data memory 103 . The path information can include the number of vertices of the path Γ _k extracted, the name data of the included vertices and their positions on the path.

Here, the path Γ _k is the k-th path starting from the vertex s _k εV _c in the cycle graph G _c of the derived database 2 .
Γ _k [l]: l-th vertex (l-th element) among the vertices that compose Γ _k
|Γ _k |: Length of path Γ _k (Number of vertices forming path Γ _k )
_Γk =( _sk ,..., _tk ),
(Γ _k [l], Γ _k [l+1])∈E _c ,
l∈{1, 2, . . . |Γ _k |}

It is assumed that any number of paths may exist in the cycle graph _Gc , but any path satisfies the following three conditions.
1. There exists d _j ∈V _d ,d _l satisfying s _k =d _j ,t _k =d _l for all s _k .
2. All edges that make up the path Γ _k are in E _c .
3. ∀c _i ∈V _c belongs to some path Γ _k .

Here, let S :={c _i ∈V _c |∃d _j ∈V _d st c _i =d _j } be the set of building names having the same notation in V _c and V _d extracted in step S2. and The name data association device associates c _i and d _j that are not elements of the set S using cycle graphs G _c and G _d as follows.

Here, one of the paths forming the cycle graph G _c is denoted by Γ _k , and the array of vertices included in the set S among the vertices forming the path Γ _k is I _k , which are defined below.
I _k :=(Γ _k [i]|Γ _k [i]∈S, Γ _k [i]≠s _k , i=1, 2, . . . │Γ _k |)

Next, the processor 101 performs the operation as the path creating section 6. FIG.
That is, the processor 101 first extracts a partial path whose endpoints are the elements of the set S from the cycle graph G _c for one path Γ _k based on the path information extracted in step S3.

is extracted (step S4). That is, L _k ⁱ is a partial path from vertex l _k [i] to vertex l _k [i+1] in path Γ _k . Also, l _k [i] is the i-th element of the array l _k .

Next, the processor 101 selects, based on the extracted partial paths, among the paths whose start point is l _k [i] and whose end point is l _k [i+1] in the cycle graph G _d of the basic database 1, All the lengths of |L _k ⁱ | (i=1, 2, . . . , |Γ _k }) and less than |L _k ⁱ |+x are counted (step S5). Here, x is a positive integer greater than or equal to 0 specified by the user. Note that the same vertex and edge are not passed twice when enumerating this path. Let the set of enumerated partial paths be

is written as

Next, the processor 101 operates as the associating unit 7 .
That is, the processor 101, first, if there is a path of length |L _k ⁱ | among the set A _k ⁱ of paths counted in step S5, sets that path to α. Under this, a combination of names is selected as follows (step S6).
(L _k ⁱ [j], α[j]), j=1, 2, . . . |L _k ⁱ |
where L _k ⁱ [j], α[j] is the j-th vertex of each path.

Furthermore, if there is a combination ^whose length is longer _{than |L k} _i ^| Using a name search technique based on edit distance), a combination of name data is searched and associated, and the result is stored in the temporary storage unit 1033 of the data memory 103 (step S7). Edit distance is disclosed, for example, in D. Gusfield. "Algorithms on strings, trees and sequences: computer science and computational biology." Cambridge university press, 1997.

FIG. 6 is a schematic diagram for explaining a method of associating names. There are name data of buildings BL _d stored in the basic database 1 (A building, B building, ... n building) and names of buildings BL _c stored in the derivative database 2 (α building, β building, ... ν building). , as indicated by the dashed line in the same figure, it is assumed that the A building and the α building and the n building and the ν building are already associated with the same name or different paths. In such a case, processor 101 can search for a combination of name data in the following procedure.

1. Initialize x=0.
2. Enumerate paths of length |L _k ⁱ |+x.
3. Exclude the names that have already been associated from the counted 1-path vertices.
4. If the length of the obtained path is larger than |L _k ⁱ |, edit from the building (γ building) that has not yet been associated among the buildings BL _c of the derived database 2 to the building BL _d of the basic database 1. Find the building with the shortest distance. For example, as the building with the shortest edit distance from the γ building, the C building can be searched and associated as indicated by the solid arrow.
5. With x=x+1, the above 2 to 4 are repeated until the upper limit value of x previously specified by the user is reached.
When searching for the building BL _d with the shortest edit distance, the search starts from the building next to the already associated building BL _d , as indicated by the dashed arrow.

As described above, the name data for which there is only one combination is output as it is, and for other names, the name data for which the output result has already been obtained is excluded from the candidates. Among the candidates left after the exclusion, those with consistency are left as the association result. Consistency here means that when there are multiple candidate names for a certain name A and there is a name B excluded by the above operation among the candidate names, a combination of the excluded name B and the name A There always exists a path P that serves as a basis for outputting (A, B). From this path P, a name combination (C, D) is also given to a name C different from the name A. Since the name combination (A,B) has been excluded, the combination (C,D) is also excluded. A more specific example will be described later as an operation example.

When the processing of one pass Γ _k is completed in this way, the processor 101 determines whether or not all of the paths Γ _k have been processed based on the path information extracted in step S3 (step S8). That is, it is determined whether the processing has been completed for all vertices of all paths Γ _k . If it is determined that there is a path Γ _k that has not yet been processed, the processor 101 updates k, shifts to the process of step S4, and repeats the processes of steps S4 to S7.

If it is determined in step S8 that all paths Γ _k have been processed, the processor 101 operates as the data output unit 8 to output name data association information (step S9). That is, processor 101 generates output information in a form instructed from input unit 107 or from an external data processing device from the association results stored in temporary storage unit 1033 of data memory 103, and outputs the generated output information. Stored in the output information storage unit 1034 of the data memory 103 . The processor 101 can display this output information on the display unit 108 via the input/output interface 105, or can transmit it to an external data processing device via the communication interface 104. FIG.

In the name data association device according to the embodiment described above, the path creation unit 6 extracts partial paths having common data as endpoints and non-common data as vertices between the endpoints. and a path having a length equal to or greater than the length of the partial path. By searching for a combination, the name data held by the basic database 1 and the name data held by the derived database 2 are associated with each other. As a result, synonymous name data that has spelling variations between databases to be integrated can be accurately matched without human intervention, even if the character string data corresponding to the name data does not have a corresponding relationship between databases. be able to. Therefore, it is possible to collect information without omission on a certain matter between different databases. In addition, the effect of improving work efficiency can be expected by reducing human operations.

In the name data association device according to one embodiment, the graph creation unit 3 creates cycle graphs _Gd and _Gc , which are undirected graphs of the basic database 1 and the derivative database 2, with the name data as vertices, The path information extracting unit 5 generates all paths Γ _k whose endpoints are the common data and whose vertices are the name data held in the derived database 2 , and for each of these paths Γ _k , the number of vertices and the names of the vertices included. Extract the path information, including the data and its position on the path. Then, for one of the paths Γ _k , the path creation unit 6 extracts partial paths from the cycle graph G _c based on the path information, and extracts partial paths from the cycle graph G _d for each of these partial paths. Create a path that has endpoints of common data that are the same as the endpoints of the path, and that includes vertices equal to or greater than the number of vertices that the partial path has. Therefore, among the name data held by the basic database 1, vertices that may be associated with the name data held by the derived database 2 are included. A path can be created that excludes vertices that have no possibility of being associated with data.

Further, here, the path creation unit 6 creates a path including the number of vertices equal to or greater than the number of vertices of the path Γ _k and equal to or less than the number of vertices specified by the user. Therefore, by limiting the number of vertices included in the path, the processing time can be shortened.

Further, in the name data associating device according to the embodiment, the associating unit 7, for each vertex on the path created by the path creating unit 6, when the position on the path corresponds to the vertex on the partial path, associates the name data corresponding to the vertex on the path among the name data held by the basic database 1 with the name data for the vertex on the partial path among the name data held by the derivative database 2 . Also, if the position on the path does not correspond to the vertex on the partial path, the associating unit 7 selects the position on the path among the name data held by the basic database 1 based on the character string similarity between the name data. name data corresponding to the vertices of , and name data of the vertices on the partial path among the name data held by the derived database 2 are associated with each other. Therefore, the name data held by the basic database 1 can be easily associated with the name data held by the derivative database 2 .

Further, the name data association device according to one embodiment repeats the processing of the path creation unit 6 and the association unit 7 until the processing for all paths Γ _k generated by the path information extraction unit 5 is completed. Therefore, it is possible to reduce the probability that the name data held by the derivative database 2 fails to be associated with the name data held by the basic database 1 .

In addition, the name data association device according to one embodiment uses the data output unit 8 to generate output information including a name data correspondence table based on the result of name data association. Therefore, by using this output information, it is possible to perform database integration processing. Further, the name data association device according to one embodiment may generate information of the integrated database as the output information.

[Example of operation]
As an operation example of the present embodiment, an overview of applied name data and results will be described.

FIG. 7 is a diagram showing an example of information held by the basic database 1 stored in the basic database storage unit 1031 in the operation example. Neighborhood information obtained from this basic database is as follows. Here, the notation (A, B) indicates that data name A and data name B are connected.
・(Fukuoka Hanazono Building, Ritsukoyama Building)
・(Ritsukoyama Building, Fukuyama Date Building)
・(Fukuyama Date Building, Kuwabara Building)
・(Kuwabara Building, Fukui Fujita Building)
・(Fukui Fujita Building, Fukuchi Yanagawa Building)
・(Fukuchi Yanagawa Building, Hoshina Building)
・(Hoshina Building, Osorezan Building)
・(Hoshina Building, Fukuoka Hanazono Building)
・(Osorezan Building, Fukuoka Hanazono Building)
・(Osorezan Building, Tsukidate Building)
・(Tsukidate Building, Fukushima Kawamata Building)
・(Fukushima Kawamata Building, Fukuoka Hanazono Building)

FIG. 8 is a diagram showing an example of information held by the derivative database 2 stored in the derivative database storage unit 1032 in the operation example. Neighborhood information obtained from this derived database is as follows. Since the path of this adjacent information is Γ _k and the number of paths handled in this operation example is one, let k=1. Here, the notation (A→B) indicates that there is a path from data name A to data name B. FIG.
・(Hanazono Building → Date Building)
・(Date Building→Kuwabara Building)
・(Kuwabara Building→Fujita Building)
・(Fujita Building → Yanagawa Building)
・(Yanagawa Building → Hoshina Building)
・(Hoshina Building → Osorezan Building)
・(Osorezan Building→Tsukikan Building)
・(Tsukikan Building → Kawamata Building)
・(Kawamata Building → Hanazono Building)

In this operation example, the vertex sets V _c and V _d for the name data of path ID=2 are as follows.
V _c = {Fukuoka Hanazono Building, Tatsukoyama Building, Fukuyama Date Building, Kuwabara Building, Fukui Fujita Building, Fukuchi Yanagawa Building, Hoshina Building, Osorezan Building, Tsukidate Building, Fukushima Kawamata Building}
V _d = {Hanazono Building, Date Building, Kuwabara Building, Fujita Building, Yanagawa Building, Hoshina Building, Osorezan Building, Tsukikan Building, Kawamata Building}

In this operation example, the combination of correct descriptions of the name data, that is, the association of the name data is as follows.
{(Tsukidate Building, Tsukikan Building), (Fukushima Kawamata Building, Kawamata Building), (Fukuoka Hanazono Building, Hanazono Building), (Fukuyama Date Building, Date Building), (Fukui Fujita Building, Fujita Building), (Fukuchi Yanagawa Building , Yanagawa Building)}

We checked whether the device for associating name data according to the embodiment can perform this association correctly.

In step S1, the processor 101 of the name data association device operates as the graph creation unit 3 to create a cycle graph. FIG. 9 is a schematic diagram showing an example of the cycle graph _Gd created from the information held by the basic database 1 in the operation example.

Further, in step S2, the processor 101 operates as the common data extraction unit 4 to extract name data common to the cycle graph _Gc and the cycle graph _Gd . Here, the name data having the same notation, that is, the building name set S is as follows.
S := {Kuwabara Building, Hoshina Building, Osorezan Building}

Therefore, in step S3, the processor 101 operates as the path information extraction unit 5 to extract path information from the derivative database 2, and in step S4, operates as the path creation unit 6 to generate a partial path. Extract. FIG. 10 is a schematic diagram showing an example of the path _Γ1 generated from the cycle graph Gc created from the information held by the derivative database 2 in the operation example. The processor 101 extracts a partial path from the cycle graph G _c whose endpoints are the elements of the building name set S.

to extract
L ₁ ¹ := (Kuwabara Building, Fujita Building, Yanagawa Building, Hoshina Building)
L ₁ ² := (Hoshina Building, Osorezan Building)
L ₁ ³ := (Osorezan Building, Tsukikan Building, Kawamata Building, Hanazono Building, Date Building, Kuwabara Building)

Then, in step S5, the processor 101 counts paths of length 3 or more and 3+x or ^less having "Kuwabara Building" and "Hoshina Building" as _endpoints on the cycle graph _Gd for the partial path L11.

Since this is an operation example, the parameter x=1. Then,
Length 3: (Kuwabara Building, Fukui Fujita Building, Fukuchi Yanagawa Building, Hoshina Building)
Length 4: (Kuwabara Building, Fukuyama Date Building, Ritsukoyama Building, Fukuoka Hanazono Building, Hoshina Building)
becomes.

Combining the corresponding vertex names of the enumerated path and the partial path L ₁ ¹ for the length 3 case yields
(Fukui Fujita Building, Fujita Building), (Fukuchi Yanagawa Building, Yanagawa Building)
get candidates for

For length 4, any combination has an edit distance of 1, so
Candidates for "Fujita Building": "Fukuyama Date Building", "Ritsukoyama Building", "Fukui Fujita Building"
Candidates for "Yanagawa Building": "Fukuoka Hanazono Building", "Ritsukoyama Building", "Fukuchi Yanagawa Building"
can be considered.

Partial path L ₁ ² is omitted because it has length 1.

For the partial path L ₁ ³ , the processor 101 counts paths having a length of 5 or more and 5+x=6 or less having "Osorezan Building" and "Kuwabara Building" as endpoints on the cycle graph G _d . Then,
Length 5: N/A Length 6: (Osorezan Building, Tsukidate Building, Fukushima Kawamata Building, Fukuoka Hanazono Building, Ritsukoyama Building, Fukuyama Date Building, Kuwabara Building)
get

From the path of length 6, selecting the point with the shortest edit distance from each vertex of the partial path L ₁ ³ ,
(Tsukidate Building, Tsukikan Building), (Fukushima Kawamata Building, Kawamata Building), (Fukuoka Hanazono Building, Hanazono Building), (Fukuyama Date Building, Date Building)
get candidates for

From the above, the combination with one candidate = answer, so
(Tsukidate Building, Tsukikan Building), (Fukushima Kawamata Building, Kawamata Building), (Fukuoka Hanazono Building, Hanazono Building), (Fukuyama Date Building, Date Building)
is the answer.

And from the answers of "Hanazono Building" and "Date Building", the candidates of "Fujita Building" and "Yanagawa Building" are
Candidates for "Fujita Building": "Ritsukoyama Building" and "Fukui Fujita Building"
Candidates for "Yanagawa Building": "Ritsukoyama Building" and "Fukuchi Yanagawa Building"
become. Here, since there are no candidates for "Fukuoka Hanazono Building" and "Fukuyama Date Building",
Pass: (Kuwabara Building, Fukuyama Date Building, Ritsukoyama Building, Fukuoka Hanazono Building, Hoshina Building)
cannot be the corresponding path in the cycle graph G _d of the partial path L ₁ ¹ . Therefore, the "Ritsukoyama Building" is also excluded from the candidates of "Fujita Building" and "Yanagawa Building".
(Fukui Fujita Building, Fujita Building), (Fukuchi Yanagawa Building, Yanagawa Building)
as an answer.

After that, the processor 101 generates output information based on the result of the association stored in the temporary storage section 1033 of the data memory 103 and stores it in the output information storage section 1034 of the data memory 103 . FIG. 11 is a diagram showing an example of output information stored in this output information storage unit 1034. As shown in FIG. Although the output information is shown here as a correspondence table showing the correspondence of name data, it is of course not limited to this.

From the above, it was verified that it is possible to associate accurate name data with the name data association device by using a partial path.

[Other embodiments]
In the above-described embodiment, the case where the number of target databases is two has been described as an example, but the number of target databases may be three or more. That is, if at least one of three or more databases holds path identification information, name data can be associated with the remaining two or more databases.

In addition, in the above-described embodiment, a path has been described as an example, but it is of course possible to deal with a closed path instead of a path (the starting point and the ending point are the same vertex).

Further, in the above-described embodiment, an example was explained in which all or part of the information held by the basic database 1 and the derived database 2 were stored in the basic database storage unit 1031 and the derived database storage unit 1032 of the data memory 103 to proceed with the processing. However, it is not limited to this. The processor 101 appropriately accesses an external data server through the communication interface 104, proceeds with processing using the information accumulated in the basic database 1 and the derivative database 2 constructed there, and obtains only the processing results of each step. may be stored in the temporary storage unit 1033 . As a result, the capacity of the data memory 103 included in the name data association device can be suppressed, and the name data association device can be configured at low cost.

Further, in the above-described embodiment, an example in which output information is generated and output to the display unit 108 or an external data processing device has been described. The result may be output. As a result, the capacity of the data memory 103 included in the name data association device can be suppressed, and the name data association device can be configured at low cost. Further, it is possible to provide a service for only associating name data with a data processing device that performs database integration processing.

Further, the method described in each embodiment can be executed by a computer (computer) as a program (software means), such as a magnetic disk (floppy (registered trademark) disk, hard disk, etc.), an optical disk (CD-ROM, DVD , MO, etc.), a semiconductor memory (ROM, RAM, flash memory, etc.), or the like, or may be transmitted and distributed via a communication medium. The programs stored on the medium also include a setting program for configuring software means (including not only execution programs but also tables and data structures) to be executed by the computer. A computer that realizes this apparatus reads a program recorded on a recording medium, and in some cases, builds software means by a setting program, and executes the above-described processes by controlling the operation by this software means. The term "recording medium" as used herein is not limited to those for distribution, and includes storage media such as magnetic disks, semiconductor memories, etc. provided in computers or devices connected via a network.

In short, the present invention is not limited to the above embodiments, and can be modified in various ways without departing from the gist of the invention at the implementation stage. Moreover, each embodiment may be implemented in combination as much as possible, and in that case, the combined effect can be obtained. Furthermore, the above-described embodiments include inventions at various stages, and various inventions can be extracted by appropriately combining a plurality of disclosed constituent elements.

REFERENCE SIGNS LIST 1 basic database 2 derivative database 3 graph creation unit 4 common data extraction unit 5 path information extraction unit 6 path creation unit 7 association unit 8 data output unit 101 processor 102 program memory 103 data Memory 104 Communication interface 105 Input/output interface 106 Bus 107 Input unit 108 Display unit 1031 Basic database storage unit 1032 Derived database storage unit 1033 Temporary storage unit 1034 Output information storage unit

Claims

A first database holding a plurality of name data and adjacency information indicating the logical or physical adjacency relationship of the name data, a plurality of name data, adjacency information of the name data, and a path to which the name data belongs A name data associating device that associates synonymous name data having a different notation with a second database holding path identification information,
a common data extraction unit that extracts name data having the same notation between the first database and the second database as common data;
Extracting from the path represented by the path identification information held by the second database, a partial path having the common data extracted by the common data extraction unit as endpoints and non-common data as vertices between the endpoints. Then, based on the information held by the first database, for each of the partial paths, a path having common data end points identical to the end points of the partial paths and having a length equal to or greater than the length of the partial paths is determined. a path creation part to be created;
For each of the partial paths extracted by the path creation unit, by searching for combinations of vertices on the partial paths and vertices on the paths created by the path creation unit, the first database holds an associating unit that associates the name data with the name data held by the second database;
A name data association device.
a graph creation unit that creates an undirected graph of the first database and the second database with the name data as vertices based on the information held by the first database and the second database;
using the common data extracted by the common data extraction unit as an end point based on the undirected graph of the second database created by the graph creation unit and the path identification information held by the second database; All paths having the name data held by the second database as vertices are generated, and path information including the number of vertices, the name data of the included vertices, and the position on the path is extracted for each of these paths. a path information extraction unit;
further comprising
The path creation unit extracts the partial path from the undirected graph of the second database created by the graph creation unit based on the path information for one of the paths generated by the path information extraction unit. extracting, for each of the partial paths, from the undirected graph of the first database, including vertices that have common data endpoints that are the same as the endpoints of the partial paths and that are equal to or greater than the number of vertices that the partial paths have; 2. The name data association device according to claim 1, which creates a path.
3. The name data association device according to claim 2, wherein said path creation unit creates, as said path, a path including a number of vertices equal to or greater than said number of vertices and equal to or less than a number specified by a user with respect to said number of vertices. .
For each of the vertices on the path created by the path creating unit, the associating unit:
If the position on the path corresponds to the vertex on the partial path, the name data corresponding to the vertex on the path among the name data held by the first database is transferred to the second associated with the name data of the vertex on the partial path among the name data held by the database of
If the position on the path does not correspond to the vertex on the partial path, based on the character string similarity between name data, associating the name data corresponding to the vertex with the name data of the vertex on the partial path among the name data held by the second database;
4. The name data association device according to any one of claims 1 to 3.
The name data associating device according to claim 2 or 3, wherein the path creating unit and the associating unit repeat the processing until all the paths generated by the path information extracting unit are processed.
The name data associating device according to any one of claims 1 to 5, further comprising an output unit that generates output information including a name data correspondence table based on the result of the associating by the associating unit.
A processor, a first database holding a plurality of name data and adjacency information indicating a logical or physical adjacency relationship of the name data, a plurality of name data, adjacency information of the name data and the name data to which the name data belongs and a memory storing a second database holding path identification information representing a path, and name data for associating synonymous name data having different notations between the first database and the second database. A method for associating name data in an associating device,
the processor extracts, as common data, name data having the same notation between the first database and the second database stored in the memory;
The processor uses the extracted common data as endpoints and non-common data as vertices between the endpoints from the path represented by the path identification information held by the second database stored in the memory. and extract the partial path with
By the processor, based on the information held by the first database stored in the memory, each of the extracted partial paths has an endpoint of common data that is the same as the endpoint of the partial path, and the partial path creates a path with a length equal to or greater than the length of the path,
The first database stored in the memory is searched by the processor for each of the extracted partial paths for a combination of each vertex on the partial path and the created vertex on the path. associates the name data held by with the name data held by the second database stored in the memory;
Name data matching method.
7. A name data association program that causes a processor to function as each part of the name data association device according to any one of claims 1 to 6.