WO2022259303A1 - Name data association device, name data association method, and name data association program - Google Patents

Name data association device, name data association method, and name data association program Download PDF

Info

Publication number
WO2022259303A1
WO2022259303A1 PCT/JP2021/021548 JP2021021548W WO2022259303A1 WO 2022259303 A1 WO2022259303 A1 WO 2022259303A1 JP 2021021548 W JP2021021548 W JP 2021021548W WO 2022259303 A1 WO2022259303 A1 WO 2022259303A1
Authority
WO
WIPO (PCT)
Prior art keywords
path
name data
database
data
building
Prior art date
Application number
PCT/JP2021/021548
Other languages
French (fr)
Japanese (ja)
Inventor
まな美 小川
正崇 佐藤
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2021/021548 priority Critical patent/WO2022259303A1/en
Priority to JP2023527147A priority patent/JPWO2022259303A1/ja
Publication of WO2022259303A1 publication Critical patent/WO2022259303A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Definitions

  • Embodiments of the present invention relate to a name data association device, a name data association method, and a name data association program.
  • Non-Patent Document 1 and Non-Patent Document 2 disclose a method of searching for the most similar character string by quantitatively calculating the degree of similarity between character strings to be searched. is proposing.
  • Non-Patent Document 3 proposes a method of accurately and efficiently searching for character strings representing the same matter by creating a search dictionary.
  • Patent Literature 1 discloses a linking method using peripheral information of data for which name identification is desired.
  • Non-Patent Documents 1 and 2 are popular and effective means when only the former abbreviated notation exists as a notation variation.
  • each common name is associated with a name that is similar in character string to the common name, so there is a high possibility of presenting an erroneous result. This is because, in many cases, common name notation is significantly different from the name that should be originally linked.
  • Non-Patent Documents 1 and 2 are created on the assumption that they will be used for Japanese, so the scope of application of the technology is limited. Since the characteristics of abbreviations in Japanese and the characteristics of other languages do not all match, the methods disclosed in Non-Patent Documents 1 and 2 can be applied without problems to name data input in other languages. is not limited.
  • Non-Patent Document 3 is the optimal method for common name notation.
  • the dictionary needs to be expanded accordingly, so there is a drawback that it takes time to cope with spelling variations.
  • Japanese Patent Application Laid-Open No. 2002-200010 proposes a technique that makes it possible to identify common names by using information around data to be identified (data A and data B in the same database are related, etc.) without relying on a dictionary.
  • the technology disclosed in Patent Document 1 is such that the graphs constructed from the name data of each database have a kind of inclusion relationship (an edge corresponding to an edge of one graph always exists in the other graph). must be satisfied). For this reason, there is a problem that it is difficult to collect name data from which a graph having a structure in which inclusion relationships are not maintained is obtained, or even if it is possible, a large number of candidate names are generated.
  • This invention seeks to provide a technology that can accurately associate synonymous name data that has notational variations between databases to be integrated without requiring human intervention.
  • a name data association device includes a first database holding a plurality of name data and adjacency information indicating a logical or physical adjacency relationship between the name data. and a second database that holds a plurality of name data, adjacency information of the name data, and path identification information representing paths to which the name data belong.
  • a device comprising a common data extraction unit, a path creation unit, and an association unit. The common data extraction unit extracts name data having the same notation between the first database and the second database as common data.
  • the path creation unit extracts a partial path having common data extracted by the common data extraction unit as endpoints and non-common data as vertices between the endpoints, from the path represented by the path identification information held by the second database. , based on the information held by the first database, for each partial path, a path having common data end points identical to the end points of the partial path and having a length equal to or greater than the length of the partial path is created.
  • the associating unit searches for combinations of vertices on the partial path and vertices on the path created by the path creation unit, thereby obtaining names held by the first database.
  • the data and the name data held by the second database are associated with each other.
  • FIG. 1 is a block diagram showing an example of the configuration of a name data association device according to one embodiment of the present invention.
  • FIG. 2 is a diagram showing an example of the hardware configuration of the name data association device.
  • FIG. 3 is a diagram showing an example of information held by a basic database stored in a basic database storage unit.
  • FIG. 4 is a diagram showing an example of information held by a derived database stored in a derived database storage unit.
  • FIG. 5 is a flow chart showing an example of processing operations related to association of name data in the name data association device.
  • FIG. 6 is a schematic diagram for explaining a method of associating names.
  • FIG. 7 is a diagram showing an example of information held by a basic database in an operation example.
  • FIG. 1 is a block diagram showing an example of the configuration of a name data association device according to one embodiment of the present invention.
  • FIG. 2 is a diagram showing an example of the hardware configuration of the name data association device.
  • FIG. 3 is a diagram showing an
  • FIG. 8 is a diagram showing an example of information held by the derivative database in the operation example.
  • FIG. 9 is a schematic diagram showing an example of a cycle graph created from information held in a basic database by the graph creating unit in the operation example.
  • FIG. 10 is a schematic diagram showing an example of a path generated from a cycle graph created from information held in a derived database by the graph creating unit in the operation example.
  • FIG. 11 is a diagram illustrating an example of output information stored in an output information storage unit in an operation example;
  • each data column can include name data and string-specific data corresponding to the name data, such as measurement value, date and time of measurement, date and time of sales, and amount of sales.
  • each database holds logical or physical adjacency information indicating the adjacency relationship of name data.
  • the adjacency information indicating the adjacency relationship of the name data includes, for example, personal connections (person A and person B are acquaintances) and connection relationship on the network (building A and building B are connected by a cable). It refers to information on how data is connected to each other.
  • each database has columns named "upper building” and “lower building”, and the name data stored in "upper building” and the name data stored in "lower building” are Indicates that they are adjacent.
  • at least one of the plurality of databases is added with path identification information indicating the path to which the name data belongs, in addition to the adjacent information.
  • FIG. 1 is a block diagram showing an example of the configuration of a name data association device according to one embodiment of the present invention.
  • the name data association device includes a basic database (in the figure, the database is abbreviated as DB) 1, a derivative database 2, a graph creation unit 3, a common data extraction unit 4, a path information extraction unit 5, a path creation unit 6, correspondence It has an attachment unit 7 and a data output unit 8 .
  • the basic database 1 is a first database that holds a plurality of name data and adjacency information indicating the adjacency relationship between the name data.
  • the derivative database 2 is a second database that holds a plurality of name data, adjacency information of the name data, and path identification information representing paths to which the name data belong.
  • the graph creation unit 3 Based on the information held by the basic database 1 and the derived database 2, the graph creation unit 3 creates an undirected graph with name data as vertices.
  • the common data extraction unit 4 extracts name data that is written in the same way between the basic database 1 and the derivative database 2 as common data.
  • the path information extraction unit 5 takes one of the common data extracted by the common data extraction unit 4 as the starting point and sets the name data held by the derivative database 2 as the vertex. , to generate all paths.
  • the end point of the path may have the same common data as the start point, or may have different common data from the start point.
  • the path information extraction unit 5 extracts path information including the number of vertices, the name data of the included vertices, and the position on the path for each of those paths.
  • the path information extraction unit 5 can extract path information based on the undirected graph created by the graph creation unit 3 and the path identification information held by the derivative database 2 .
  • the path creation unit 6 extracts partial paths having each common data as an end point, that is, a start point and an end point, for each path indicated by the path information extracted by the path information extraction unit 5 . Then, based on the information held by the basic database 1, the path creation unit 6 counts up all paths whose endpoints are common data that match each partial path and that have a prescribed length. For example, the path creation unit 6 can enumerate paths based on the undirected graph created by the graph creation unit 3 from the basic database 1 .
  • the associating unit 7 searches for a combination of vertex name data from the partial paths extracted by the path creating unit 6 and the counted paths, for example, based on the character string similarity such as the edit distance. Then, the associating unit 7 associates the name data held by the basic database 1 with the name data held by the derivative database 2 based on the searched combinations.
  • the data output unit 8 generates output information based on the result of association by the association unit 7 and outputs it.
  • the data output unit 8 can generate, as output information, a correspondence table representing the correspondence of name data based on the result of association by the association unit 7 .
  • the data output unit 8 converts the name data of the information held in the basic database 1 based on the result of association by the association unit 7, creates a new database, and uses this as output information. You can make it work.
  • the data output unit 8 integrates the information held by the basic database 1 and the derivative database 2 based on the result of association by the association unit 7, creates a new database, and outputs this as output information. You can also use
  • FIG. 2 is a diagram showing an example of the hardware configuration of the name data association device.
  • the name data association device is composed of a computer such as a server computer or a personal computer, and has a hardware processor 101 such as a CPU (Central Processing Unit).
  • a program memory 102 In the name data association device, a program memory 102, a data memory 103, a communication interface 104, and an input/output interface (denoted as an input/output IF in FIG. connected through
  • the communication interface 104 can include, for example, one or more wired or wireless communication modules. If the basic database 1 and/or the derivative database 2 are configured in a data server or the like connected via a network such as a LAN (Local Area Network) or the Internet, the communication interface 104 is connected to the data server or the like. can communicate between and retrieve data from those data servers. Also, the communication interface 104 can communicate with an external data processing device or the like to receive a request from the data processing device, and can also send a data processing result corresponding to the request back to the data processing device. .
  • LAN Local Area Network
  • An input unit 107 and a display unit 108 are connected to the input/output interface 105 .
  • the input unit 107 and the display unit 108 are so-called tablet-type inputs, in which an input detection sheet adopting an electrostatic method or a pressure method is arranged on a display screen of a display device using liquid crystal or organic EL (Electro Luminescence), for example. - using a display device can be used; Note that the input unit 107 and the display unit 108 may be configured by independent devices.
  • the input/output interface 105 inputs operation information input from the input unit 107 to the processor 101 and displays display information generated by the processor 101 on the display unit 108 .
  • the input unit 107 and the display unit 108 do not have to be connected to the input/output interface 105 .
  • the input unit 107 and the display unit 108 are provided with a communication unit for connecting to the communication interface 104 directly or via a network, so that information can be exchanged with the processor 101 .
  • the input/output interface 105 may have a read/write function for a recording medium such as a semiconductor memory such as a flash memory, or may be connected to a reader/writer having a read/write function for such a recording medium. It may have functions. As a result, a recording medium detachable from the name data association device can be used as a database for holding name data.
  • the input/output interface 105 may further have a connection function with other devices.
  • the program memory 102 is a non-temporary tangible computer-readable storage medium, for example, a non-volatile memory such as a HDD (Hard Disk Drive) or SSD (Solid State Drive) that can be written and read at any time, and a non-volatile memory such as a ROM. It is used in combination with a static memory.
  • the program memory 102 stores programs necessary for the processor 101 to execute various control processes according to one embodiment. That is, the processing function units in each of the above-described graph creation unit 3, common data extraction unit 4, path information extraction unit 5, path creation unit 6, association unit 7, and data output unit 8 are all stored in the program memory 102. It can be realized by causing the processor 101 to read and execute the stored program. Some or all of these processing functions may be implemented in various other forms, including integrated circuits such as Application Specific Integrated Circuits (ASICs) or field-programmable gate arrays (FPGAs). May be.
  • ASICs Application Specific Integrated Circuits
  • FPGAs field-programmable gate arrays
  • the data memory 103 is a tangible computer-readable storage medium, for example, a combination of the above nonvolatile memory and a volatile memory such as RAM (Random Access Memory).
  • This data memory 103 is used to store various data acquired and created in the process of performing various processes. That is, in the data memory 103, an area for storing various data is appropriately secured in the process of performing various processes. As such areas, the data memory 103 can be provided with, for example, a basic database storage unit 1031 , a derived database storage unit 1032 , a temporary storage unit 1033 and an output information storage unit 1034 .
  • the basic database storage unit 1031 stores information of the basic database 1, and the derived database storage unit 1032 stores information of the derived database 2. That is, the basic database 1 and the derivative database 2 can be configured in the basic database storage unit 1031 and the derivative database storage unit 1032 .
  • FIG. 3 is a diagram showing an example of information held by the basic database 1 stored in the basic database storage unit 1031
  • FIG. 4 is an example of information held by the derived database 2 stored in the derived database storage unit 1032. It is a figure which shows. Here, an example is shown in which the name data is the name of a building. In the basic database 1 stored in the basic database storage unit 1031, the upper building and the lower building are adjacent to each other.
  • a combination of buildings having the same path identifier (identifier is abbreviated as ID in the figure) is represented by one path (Shinjuku Building ⁇ Minami-Shinjuku Building ⁇ Gaien Building ⁇ Yotsuya Building ⁇ Shinjuku Building).
  • ID path identifier
  • the building names in the derived database 2 are denoted by c i (i ⁇ ⁇ 1, 2, ..., n ⁇ )
  • the building names in the basic database 1 are denoted by d j (j ⁇ ⁇ 1, 2, ..., m ⁇ ).
  • n and m are the number of building names in each database.
  • the information stored in the basic database storage unit 1031 and the derived database storage unit 1032 is, for example, the information of the basic database 1 and the derived database 2 input from the input unit 107 received by the processor 101 via the input/output interface 105.
  • a base database 1 and a derived database 2 can be constructed in the data memory 103 .
  • all or part of the information held by the basic database 1 and the derived database 2 constructed in an external data server may be stored in the basic database storage unit 1031 and the derived database storage unit 1032 .
  • the processor 101 acquires information accumulated in the database server via the communication interface 104 in response to an instruction by a user operation from the input unit 107, and stores them in the storage units 1031 and 1032.
  • processor 101 may acquire information recorded on a recording medium via input/output interface 105 .
  • the processor 101 also receives a request for associating the information of the basic database 1 and the derived database 2 with the name data from an external data processing device or the like via the communication interface 104, and processes the received database information. It may be stored in the storage units 1031 and 1032 as target information.
  • the temporary storage unit 1033 stores the undirected graph created when the processor 101 performs the operation as the graph creation unit 3, the common data extracted when the operation as the common data extraction unit 4 is performed, the path Path information about all paths extracted when the operation of the information extraction unit 5 is performed, partial paths and counted paths extracted when the operation of the path creation unit 6 is performed, and the correspondence unit 7 Stores the result of association of name data obtained when performing the operation of .
  • the output information storage unit 1034 stores output information obtained when the processor 101 operates as the data output unit 8 described above.
  • FIG. 5 is a flow chart showing an example of a processing operation related to association of name data in the name data association device.
  • the information of the basic database 1 is already stored in the basic database storage unit 1031 and the information of the derivative database 2 is already stored in the derivative database storage unit 1032 .
  • the input unit 107 via the input/output interface 105 or an external data processing device via the communication interface 104 instructs to perform name data matching
  • the processor 101 of the name data matching device The operation shown in this flow chart is started.
  • the processor 101 performs the operation as the graph creation unit 3 . That is, the processor 101 uses the adjacency information for each of the information of the derived database 2 stored in the derived database storage unit 1032 and the information of the basic database 1 stored in the basic database storage unit 1031 to extract the name data.
  • Cycle graphs G c and G d to be vertices are generated (step S1).
  • the generated cycle graphs G c and G d are stored in the temporary storage unit 1033 of the data memory 103 .
  • Building name c i in the derived database 2 and building name d j in the basic database 1 are each taken as vertices, and if it is interpreted that adjacent vertices are connected by edges, the following undirected graph is Certain cycle graphs G c and G d can be constructed.
  • a cycle is a subgraph of the cycle graph G c and indicates a path whose start point and end point are the same vertex.
  • E d A set of edges obtained from the adjacency information of the basic database 1 g d : E d ⁇ P(V d ) A mapping that associates a subset of the vertex set V d with an element of Ed .
  • E c A set of edges obtained from the adjacency information of the derivative database 2 g c : E c ⁇ P(V c ) A mapping that associates a subset of the vertex set V c with the elements of E c .
  • the processor 101 of the name data association device operates as the common data extraction unit 4 . That is, the processor 101 extracts common name data between the information of the basic database 1 stored in the basic database storage unit 1031 and the information of the derived database 2 stored in the derived database storage unit 1032 (step S2). The extracted common name data is stored in temporary storage section 1033 of data memory 103 .
  • the processor 101 executes the operation as the path information extractor 5.
  • Path information indicating the extracted path ⁇ k is stored in the temporary storage unit 1033 of the data memory 103 .
  • the path information can include the number of vertices of the path ⁇ k extracted, the name data of the included vertices and their positions on the path.
  • I k the array of vertices included in the set S among the vertices forming the path ⁇ k , which are defined below.
  • I k : ( ⁇ k [i]
  • ⁇ k [i] ⁇ S, ⁇ k [i] ⁇ s k , i 1, 2, . . . ⁇ k
  • the processor 101 performs the operation as the path creating section 6.
  • L k i is a partial path from vertex l k [i] to vertex l k [i+1] in path ⁇ k .
  • l k [i] is the i-th element of the array l k .
  • the processor 101 selects, based on the extracted partial paths, among the paths whose start point is l k [i] and whose end point is l k [i+1] in the cycle graph G d of the basic database 1, All the lengths of
  • (i 1, 2, . . . ,
  • +x are counted (step S5).
  • x is a positive integer greater than or equal to 0 specified by the user. Note that the same vertex and edge are not passed twice when enumerating this path. Let the set of enumerated partial paths be
  • the processor 101 operates as the associating unit 7 . That is, the processor 101, first, if there is a path of length
  • among the set A k i of paths counted in step S5, sets that path to ⁇ . Under this, a combination of names is selected as follows (step S6). (L k i [j], ⁇ [j]), j 1, 2, . . .
  • step S7 a combination of name data is searched and associated, and the result is stored in the temporary storage unit 1033 of the data memory 103 (step S7).
  • Edit distance is disclosed, for example, in D. Gusfield. "Algorithms on strings, trees and sequences: computer science and computational biology.” Cambridge university press, 1997.
  • FIG. 6 is a schematic diagram for explaining a method of associating names.
  • name data of buildings BL d stored in the basic database 1 A building, B building, ... n building
  • names of buildings BL c stored in the derivative database 2 ⁇ building, ⁇ building, ... ⁇ building.
  • processor 101 can search for a combination of name data in the following procedure.
  • the name data for which there is only one combination is output as it is, and for other names, the name data for which the output result has already been obtained is excluded from the candidates.
  • Consistency means that when there are multiple candidate names for a certain name A and there is a name B excluded by the above operation among the candidate names, a combination of the excluded name B and the name A
  • a path P that serves as a basis for outputting (A, B). From this path P, a name combination (C, D) is also given to a name C different from the name A. Since the name combination (A,B) has been excluded, the combination (C,D) is also excluded.
  • a more specific example will be described later as an operation example.
  • the processor 101 determines whether or not all of the paths ⁇ k have been processed based on the path information extracted in step S3 (step S8). That is, it is determined whether the processing has been completed for all vertices of all paths ⁇ k . If it is determined that there is a path ⁇ k that has not yet been processed, the processor 101 updates k, shifts to the process of step S4, and repeats the processes of steps S4 to S7.
  • step S9 the processor 101 operates as the data output unit 8 to output name data association information (step S9). That is, processor 101 generates output information in a form instructed from input unit 107 or from an external data processing device from the association results stored in temporary storage unit 1033 of data memory 103, and outputs the generated output information. Stored in the output information storage unit 1034 of the data memory 103 . The processor 101 can display this output information on the display unit 108 via the input/output interface 105, or can transmit it to an external data processing device via the communication interface 104.
  • the path creation unit 6 extracts partial paths having common data as endpoints and non-common data as vertices between the endpoints. and a path having a length equal to or greater than the length of the partial path.
  • the name data held by the basic database 1 and the name data held by the derived database 2 are associated with each other.
  • synonymous name data that has spelling variations between databases to be integrated can be accurately matched without human intervention, even if the character string data corresponding to the name data does not have a corresponding relationship between databases. be able to. Therefore, it is possible to collect information without omission on a certain matter between different databases.
  • the effect of improving work efficiency can be expected by reducing human operations.
  • the graph creation unit 3 creates cycle graphs Gd and Gc , which are undirected graphs of the basic database 1 and the derivative database 2, with the name data as vertices,
  • the path information extracting unit 5 generates all paths ⁇ k whose endpoints are the common data and whose vertices are the name data held in the derived database 2 , and for each of these paths ⁇ k , the number of vertices and the names of the vertices included. Extract the path information, including the data and its position on the path.
  • the path creation unit 6 extracts partial paths from the cycle graph G c based on the path information, and extracts partial paths from the cycle graph G d for each of these partial paths.
  • a path can be created that excludes vertices that have no possibility of being associated with data.
  • the path creation unit 6 creates a path including the number of vertices equal to or greater than the number of vertices of the path ⁇ k and equal to or less than the number of vertices specified by the user. Therefore, by limiting the number of vertices included in the path, the processing time can be shortened.
  • the associating unit 7 for each vertex on the path created by the path creating unit 6, when the position on the path corresponds to the vertex on the partial path, associates the name data corresponding to the vertex on the path among the name data held by the basic database 1 with the name data for the vertex on the partial path among the name data held by the derivative database 2 . Also, if the position on the path does not correspond to the vertex on the partial path, the associating unit 7 selects the position on the path among the name data held by the basic database 1 based on the character string similarity between the name data.
  • name data corresponding to the vertices of , and name data of the vertices on the partial path among the name data held by the derived database 2 are associated with each other. Therefore, the name data held by the basic database 1 can be easily associated with the name data held by the derivative database 2 .
  • the name data association device repeats the processing of the path creation unit 6 and the association unit 7 until the processing for all paths ⁇ k generated by the path information extraction unit 5 is completed. Therefore, it is possible to reduce the probability that the name data held by the derivative database 2 fails to be associated with the name data held by the basic database 1 .
  • the name data association device uses the data output unit 8 to generate output information including a name data correspondence table based on the result of name data association. Therefore, by using this output information, it is possible to perform database integration processing. Further, the name data association device according to one embodiment may generate information of the integrated database as the output information.
  • FIG. 7 is a diagram showing an example of information held by the basic database 1 stored in the basic database storage unit 1031 in the operation example. Neighborhood information obtained from this basic database is as follows.
  • the notation (A, B) indicates that data name A and data name B are connected.
  • V c ⁇ Fukuoka Hanazono Building, Tatsukoyama Building, Fukuyama Date Building, Kuwabara Building, Fukui Fujita Building, Fukuchi Yanagawa Building, Hoshina Building, Osorezan Building, Tsukidate Building, Fukushima Kawamata Building ⁇
  • V d ⁇ Hanazono Building, Date Building, Kuwabara Building, Fujita Building, Yanagawa Building, Hoshina Building, Osorezan Building, Tsukikan Building, Kawamata Building ⁇
  • the combination of correct descriptions of the name data that is, the association of the name data is as follows. ⁇ (Tsukidate Building, Tsukikan Building), (Fukushima Kawamata Building, Kawamata Building), (Fukuoka Hanazono Building, Hanazono Building), (Fukuyama Date Building, Date Building), (Fukui Fujita Building, Fujita Building), (Fukuchi Yanagawa Building , Yanagawa Building) ⁇
  • step S1 the processor 101 of the name data association device operates as the graph creation unit 3 to create a cycle graph.
  • FIG. 9 is a schematic diagram showing an example of the cycle graph Gd created from the information held by the basic database 1 in the operation example.
  • step S2 the processor 101 operates as the common data extraction unit 4 to extract name data common to the cycle graph Gc and the cycle graph Gd .
  • step S3 the processor 101 operates as the path information extraction unit 5 to extract path information from the derivative database 2, and in step S4, operates as the path creation unit 6 to generate a partial path. Extract.
  • FIG. 10 is a schematic diagram showing an example of the path ⁇ 1 generated from the cycle graph Gc created from the information held by the derivative database 2 in the operation example.
  • the processor 101 extracts a partial path from the cycle graph G c whose endpoints are the elements of the building name set S.
  • L 1 1 (Kuwabara Building, Fujita Building, Yanagawa Building, Hoshina Building)
  • L 1 2 (Hoshina Building, Osorezan Building)
  • L 1 3 (Osorezan Building, Tsukikan Building, Kawamata Building, Hanazono Building, Date Building, Kuwabara Building)
  • step S5 the processor 101 counts paths of length 3 or more and 3+x or less having "Kuwabara Building” and "Hoshina Building” as endpoints on the cycle graph Gd for the partial path L11.
  • Length 3 (Kuwabara Building, Fukui Fujita Building, Fukuchi Yanagawa Building, Hoshina Building)
  • Length 4 (Kuwabara Building, Fukuyama Date Building, Ritsukoyama Building, Fukuoka Hanazono Building, Hoshina Building) becomes.
  • any combination has an edit distance of 1, so Candidates for "Fujita Building”: “Fukuyama Date Building”, “Ritsukoyama Building", “Fukui Fujita Building” Candidates for "Yanagawa Building”: “Fukuoka Hanazono Building", “Ritsukoyama Building”, “Fukuchi Yanagawa Building” can be considered.
  • Partial path L 1 2 is omitted because it has length 1.
  • FIG. 11 is a diagram showing an example of output information stored in this output information storage unit 1034. As shown in FIG. Although the output information is shown here as a correspondence table showing the correspondence of name data, it is of course not limited to this.
  • the number of target databases is two has been described as an example, but the number of target databases may be three or more. That is, if at least one of three or more databases holds path identification information, name data can be associated with the remaining two or more databases.
  • the processor 101 appropriately accesses an external data server through the communication interface 104, proceeds with processing using the information accumulated in the basic database 1 and the derivative database 2 constructed there, and obtains only the processing results of each step. may be stored in the temporary storage unit 1033 .
  • the capacity of the data memory 103 included in the name data association device can be suppressed, and the name data association device can be configured at low cost.
  • the method described in each embodiment can be executed by a computer (computer) as a program (software means), such as a magnetic disk (floppy (registered trademark) disk, hard disk, etc.), an optical disk (CD-ROM, DVD , MO, etc.), a semiconductor memory (ROM, RAM, flash memory, etc.), or the like, or may be transmitted and distributed via a communication medium.
  • the programs stored on the medium also include a setting program for configuring software means (including not only execution programs but also tables and data structures) to be executed by the computer.
  • a computer that realizes this apparatus reads a program recorded on a recording medium, and in some cases, builds software means by a setting program, and executes the above-described processes by controlling the operation by this software means.
  • the term "recording medium” as used herein is not limited to those for distribution, and includes storage media such as magnetic disks, semiconductor memories, etc. provided in computers or devices connected via a network.
  • the present invention is not limited to the above embodiments, and can be modified in various ways without departing from the gist of the invention at the implementation stage.
  • each embodiment may be implemented in combination as much as possible, and in that case, the combined effect can be obtained.
  • the above-described embodiments include inventions at various stages, and various inventions can be extracted by appropriately combining a plurality of disclosed constituent elements.

Abstract

A name data association device according to one embodiment includes: a common data extraction unit that extracts, as common data, name data with the same representation between a first database (DB) and a second DB, the first DB holding a plurality of pieces of name data and adjacent information indicating adjacent relationships between the name data, the second DB holding the plurality of pieces of name data, adjacent information between the name data, and path identification information representing paths to which these name data belong; a path generation unit that extracts partial paths from a path represented by the path identification information held by the second DB, the partial paths having the common data as endpoints and having non-common data as a vertex between the endpoints, and generates, on the basis of the information held by the first DB, for each of the partial paths, a path having the same common data endpoints as the endpoints of the partial paths and a length equal to or larger than that of the partial paths; and an association unit that searches a combination of the vertexes on the partial paths and a vertex on the path for each of the partial paths to thereby associate the name data between the first DB and the second DB.

Description

名称データ対応付け装置、名称データ対応付け方法及び名称データ対応付けプログラムName data association device, name data association method, and name data association program
 この発明の実施形態は、名称データ対応付け装置、名称データ対応付け方法及び名称データ対応付けプログラムに関する。 Embodiments of the present invention relate to a name data association device, a name data association method, and a name data association program.
 データベースを用いた業務には、異なる管理下にあるデータベースを統合し、格納されていた名称データを横並びで使用することで、より多角的・包括的な分析を行うことがある。そのためには、統合するデータベースの間で同一の事柄を表している名称データに対し、同一の識別情報を付与するなどして、名称データを統合する、所謂「名寄せ」という作業が必要になる。 In the work using databases, we may integrate databases under different management and use the stored name data side by side to conduct more diversified and comprehensive analysis. For this purpose, it is necessary to integrate the name data by giving the same identification information to the name data representing the same matter in the databases to be integrated, which is called "name matching".
 しかしながら、名称データの入力方法は、データベースの管理元に依る。そのため、統合するデータベースの間で同一の事柄を表しているにもかかわらず、その表記が異なるという状況(表記ゆれ)はしばしば存在する。表記ゆれを含むデータベースを統合してしまうと、前述のような分析を行う際に、一つの事柄に関連する情報が表記ゆれを起こした部分だけ不足してしまう事態が発生してしまう。 However, the method of entering name data depends on the database manager. Therefore, a situation (notation variation) often exists in which the notation differs even though the same thing is expressed in the databases to be integrated. If a database containing spelling variations is integrated, when performing the above-mentioned analysis, there will be a situation in which the information related to one matter is insufficient only for the part where the spelling variations have occurred.
 このような表記ゆれに対処する技術として、非特許文献1及び非特許文献2は、検索対象の文字列同士の類似度を定量的に計算することで、最も似ている文字列を検索する手法を提案している。また、非特許文献3は、検索用の辞書を作成することで正確且つ効率良く同一の事柄を表す文字列を探し出す方法を提案している。また、特許文献1は、名寄せをしたいデータの周辺情報を用いた紐づけ方法を開示している。 As a technique for coping with such spelling variations, Non-Patent Document 1 and Non-Patent Document 2 disclose a method of searching for the most similar character string by quantitatively calculating the degree of similarity between character strings to be searched. is proposing. Non-Patent Document 3 proposes a method of accurately and efficiently searching for character strings representing the same matter by creating a search dictionary. Further, Patent Literature 1 discloses a linking method using peripheral information of data for which name identification is desired.
日本国特開2020-123210号公報Japanese Patent Application Laid-Open No. 2020-123210
 表記のゆれ方には、登録データ名を省略した表記と、使用者同士でのローカルルールに基づく呼び名(通称)による表記と、が存在する。 There are two types of notation: one that omits the registered data name, and another that uses a name (common name) based on local rules between users.
 非特許文献1及び2に開示されているような手法は、前者の省略表記のみが表記ゆれとして存在する場合には、ポピュラー且つ有効な手段である。しかしながら、後者の通称表記が混在している状況下では、各通称に対してその通称と文字列的に類似した名称が紐付けられるため、誤った結果を提示する可能性が高い。なぜならば、通称表記は、本来紐付けられるべき名称と著しくかけ離れているケースが多いためである。 The methods disclosed in Non-Patent Documents 1 and 2 are popular and effective means when only the former abbreviated notation exists as a notation variation. However, in the latter situation where common name notations are mixed, each common name is associated with a name that is similar in character string to the common name, so there is a high possibility of presenting an erroneous result. This is because, in many cases, common name notation is significantly different from the name that should be originally linked.
 また、前者の省略表記のみを扱う場合であっても、非特許文献1及び2に開示の手法は、日本語に対して使用されることを想定して作られているので、技術の適用範囲が限定的である。日本語における省略表記の特徴と他言語における特徴は全てが一致するわけではないので、非特許文献1及び2に開示の手法が、他言語で入力された名称データに対して問題なく適用可能とは限らないためである。 Also, even if only the former abbreviated notation is handled, the methods disclosed in Non-Patent Documents 1 and 2 are created on the assumption that they will be used for Japanese, so the scope of application of the technology is limited. Since the characteristics of abbreviations in Japanese and the characteristics of other languages do not all match, the methods disclosed in Non-Patent Documents 1 and 2 can be applied without problems to name data input in other languages. is not limited.
 よって、通称表記に対しては、非特許文献3に開示されているような、辞書を作成することが最適な手法だと考えられている。しかしながら、統合するデータベースの個数が増加すると、それに伴い辞書を拡張する必要が発生するため、表記ゆれに対処可能になるまでに時間が掛かるという欠点がある。 Therefore, it is considered that creating a dictionary, such as that disclosed in Non-Patent Document 3, is the optimal method for common name notation. However, as the number of databases to be integrated increases, the dictionary needs to be expanded accordingly, so there is a drawback that it takes time to cope with spelling variations.
 そこで、特許文献1では、辞書に頼らず、名寄せの対象データ周辺の情報(同データベースのデータAとデータBは繋がりがある、など)を用いて、通称を名寄せ可能とする技術を提案している。しかしながら、この特許文献1に開示されているような技術は、それぞれのデータベースの名称データから構築できるグラフが、一種の包含関係(一方のグラフの辺に対応する辺が、他方のグラフに必ず存在する)を満たしている必要がある。そのため、包含関係が保たれていない構造のグラフが得られてしまう名称データを名寄せすることは困難か、できたとしても候補となる名称が大量に出てきてしまう、という問題があった。 Therefore, Japanese Patent Application Laid-Open No. 2002-200010 proposes a technique that makes it possible to identify common names by using information around data to be identified (data A and data B in the same database are related, etc.) without relying on a dictionary. there is However, the technology disclosed in Patent Document 1 is such that the graphs constructed from the name data of each database have a kind of inclusion relationship (an edge corresponding to an edge of one graph always exists in the other graph). must be satisfied). For this reason, there is a problem that it is difficult to collect name data from which a graph having a structure in which inclusion relationships are not maintained is obtained, or even if it is possible, a large number of candidate names are generated.
 この発明は、統合するデータベース間で表記ゆれが存在する同義の名称データを、人的稼働を掛けず正確に対応付けることができる技術を提供しようとするものである。 This invention seeks to provide a technology that can accurately associate synonymous name data that has notational variations between databases to be integrated without requiring human intervention.
 上記課題を解決するために、この発明の一態様に係る名称データ対応付け装置は、複数の名称データ及びそれら名称データの論理的または物理的な隣接関係を示す隣接情報を保持する第1のデータベースと、複数の名称データ、それら名称データの隣接情報及びそれら名称データが属するパスを表すパス識別情報を保持する第2のデータベースとの間で異なる表記を有する同義の名称データを対応付ける名称データ対応付け装置であって、共通データ抽出部と、パス作成部と、対応付け部と、を備える。共通データ抽出部は、第1のデータベースと第2のデータベースとの間で同じ表記である名称データを共通データとして抽出する。パス作成部は、第2のデータベースが保持するパス識別情報で表されるパスから、共通データ抽出部が抽出した共通データを端点とし且つ非共通データを端点間の頂点とする部分パスを抽出し、第1のデータベースが保持する情報に基づいて、部分パスそれぞれについて、部分パスの端点と同じ共通データの端点を持ち且つ部分パスの長さ以上の長さを持つパスを作成する。対応付け部は、パス作成部が抽出した部分パスそれぞれについて、部分パス上の各頂点とパス作成部が作成したパス上の頂点との組み合わせを探索することで、第1のデータベースが保持する名称データと第2のデータベースが保持する名称データとを対応付ける。 To solve the above problems, a name data association device according to an aspect of the present invention includes a first database holding a plurality of name data and adjacency information indicating a logical or physical adjacency relationship between the name data. and a second database that holds a plurality of name data, adjacency information of the name data, and path identification information representing paths to which the name data belong. A device comprising a common data extraction unit, a path creation unit, and an association unit. The common data extraction unit extracts name data having the same notation between the first database and the second database as common data. The path creation unit extracts a partial path having common data extracted by the common data extraction unit as endpoints and non-common data as vertices between the endpoints, from the path represented by the path identification information held by the second database. , based on the information held by the first database, for each partial path, a path having common data end points identical to the end points of the partial path and having a length equal to or greater than the length of the partial path is created. For each partial path extracted by the path creation unit, the associating unit searches for combinations of vertices on the partial path and vertices on the path created by the path creation unit, thereby obtaining names held by the first database. The data and the name data held by the second database are associated with each other.
 この発明の一態様によれば、統合するデータベース間で表記ゆれが存在する同義の名称データを人的稼働を掛けず正確に対応付けることができる技術を提供することができる。 According to one aspect of the present invention, it is possible to provide a technology capable of accurately associating synonymous name data with spelling variations between databases to be integrated without requiring human operations.
図1は、この発明の一実施形態に係る名称データ対応付け装置の構成の一例を示すブロック図である。FIG. 1 is a block diagram showing an example of the configuration of a name data association device according to one embodiment of the present invention. 図2は、名称データ対応付け装置のハードウェア構成の一例を示す図である。FIG. 2 is a diagram showing an example of the hardware configuration of the name data association device. 図3は、基礎データベース記憶部に記憶される基礎データベースが保持する情報の一例を示す図である。FIG. 3 is a diagram showing an example of information held by a basic database stored in a basic database storage unit. 図4は、派生データベース記憶部に記憶される派生データベースが保持する情報の一例を示す図である。FIG. 4 is a diagram showing an example of information held by a derived database stored in a derived database storage unit. 図5は、名称データ対応付け装置における名称データの対応付けに係わる処理動作の一例を示すフローチャートである。FIG. 5 is a flow chart showing an example of processing operations related to association of name data in the name data association device. 図6は、名称の対応付け方法を説明するための模式図である。FIG. 6 is a schematic diagram for explaining a method of associating names. 図7は、動作例において基礎データベースが保持する情報の一例を示す図である。FIG. 7 is a diagram showing an example of information held by a basic database in an operation example. 図8は、動作例において派生データベースが保持する情報の一例を示す図である。FIG. 8 is a diagram showing an example of information held by the derivative database in the operation example. 図9は、動作例においてグラフ作成部によって基礎データベースが保持する情報から作成された閉路グラフの一例を示す模式図である。FIG. 9 is a schematic diagram showing an example of a cycle graph created from information held in a basic database by the graph creating unit in the operation example. 図10は、動作例においてグラフ作成部によって派生データベースが保持する情報から作成された閉路グラフから生成されたパスの一例を示す模式図である。FIG. 10 is a schematic diagram showing an example of a path generated from a cycle graph created from information held in a derived database by the graph creating unit in the operation example. 図11は、動作例において出力情報記憶部に記憶される出力情報の一例を示す図である。FIG. 11 is a diagram illustrating an example of output information stored in an output information storage unit in an operation example;
 以下、図面を参照して、この発明に係わる実施形態を説明する。 Hereinafter, embodiments according to the present invention will be described with reference to the drawings.
 本実施形態において、複数のデータベースは、異なる表記を有する同義の名称データを保持しており、これらデータベースで名称データを対応付けしたいデータカラムは、既知であるとする。各データカラムは、名称データと、例えば測定値や測定日時、売上日時や売上金額、などといった、名称データに対応する文字列別データを含むことができる。また、各データベースは、名称データの隣接関係を示す論理的あるいは物理的な隣接情報を保持していることを想定する。ここで、名称データの隣接関係を示す隣接情報とは、例えば、人脈(人物Aと人物Bが知り合いである)や、ネットワーク上の接続関係(ビルAとビルBがケーブルによって接続されている)といった、データ同士の繋がり方の情報を指す。この発明は、名称データを対応付けするデータベースの個数に制限は特にないが、本一実施形態では、説明の簡単化のため、対象とするデータベースは2つであるとする。また、各データベース内の名称データ間には、ネットワーク上の接続関係があるとする。具体的には、各データベースに「上位ビル」「下位ビル」という名前のカラムがあり、「上位ビル」に格納された名称データと「下位ビル」に格納された名称データは、あるネットワーク上で隣接していることを表す。加えて、複数のデータベースのうち少なくとも1つには、隣接情報に加えて、名称データが属するパスを表すパス識別情報が追加されていると想定する。 In this embodiment, it is assumed that multiple databases hold synonymous name data with different notations, and the data columns with which the name data are to be associated in these databases are already known. Each data column can include name data and string-specific data corresponding to the name data, such as measurement value, date and time of measurement, date and time of sales, and amount of sales. It is also assumed that each database holds logical or physical adjacency information indicating the adjacency relationship of name data. Here, the adjacency information indicating the adjacency relationship of the name data includes, for example, personal connections (person A and person B are acquaintances) and connection relationship on the network (building A and building B are connected by a cable). It refers to information on how data is connected to each other. Although the present invention does not particularly limit the number of databases with which name data is associated, in this embodiment, for the sake of simplicity of explanation, it is assumed that there are two target databases. It is also assumed that there is a connection relationship on the network between the name data in each database. Specifically, each database has columns named "upper building" and "lower building", and the name data stored in "upper building" and the name data stored in "lower building" are Indicates that they are adjacent. In addition, it is assumed that at least one of the plurality of databases is added with path identification information indicating the path to which the name data belongs, in addition to the adjacent information.
 (構成例)
 図1は、この発明の一実施形態に係る名称データ対応付け装置の構成の一例を示すブロック図である。名称データ対応付け装置は、基礎データベース(図では、データベースをDBと略記する。)1、派生データベース2、グラフ作成部3、共通データ抽出部4、パス情報抽出部5、パス作成部6、対応付け部7及びデータ出力部8を有する。
(Configuration example)
FIG. 1 is a block diagram showing an example of the configuration of a name data association device according to one embodiment of the present invention. The name data association device includes a basic database (in the figure, the database is abbreviated as DB) 1, a derivative database 2, a graph creation unit 3, a common data extraction unit 4, a path information extraction unit 5, a path creation unit 6, correspondence It has an attachment unit 7 and a data output unit 8 .
 基礎データベース1は、複数の名称データと、それら名称データの隣接関係を示す隣接情報と、を保持する第1のデータベースである。また、派生データベース2は、複数の名称データと、それら名称データの隣接情報と、それら名称データが属するパスを表すパス識別情報と、を保持する第2のデータベースである。 The basic database 1 is a first database that holds a plurality of name data and adjacency information indicating the adjacency relationship between the name data. The derivative database 2 is a second database that holds a plurality of name data, adjacency information of the name data, and path identification information representing paths to which the name data belong.
 グラフ作成部3は、基礎データベース1及び派生データベース2が保持する情報に基づいて、名称データを頂点とする無向グラフを作成する。 Based on the information held by the basic database 1 and the derived database 2, the graph creation unit 3 creates an undirected graph with name data as vertices.
 共通データ抽出部4は、基礎データベース1と派生データベース2との間で同じ表記である名称データを、共通データとして抽出する。 The common data extraction unit 4 extracts name data that is written in the same way between the basic database 1 and the derivative database 2 as common data.
 パス情報抽出部5は、派生データベース2が保持するパス識別情報に基づいて、共通データ抽出部4が抽出した共通データのうち1つを始点とし、派生データベース2が保持する名称データを頂点とする、全てのパスを生成する。パスの終点は、始点と同じ共通データとなる場合もあるし、始点と異なる共通データとなる場合もある。そして、パス情報抽出部5は、それらのパスそれぞれについて、頂点数、含まれる頂点の名称データ及びそのパス上の位置を含むパス情報を抽出する。例えば、パス情報抽出部5は、グラフ作成部3が作成した無向グラフと派生データベース2が保持するパス識別情報とに基づいて、パス情報を抽出することができる。 Based on the path identification information held by the derivative database 2, the path information extraction unit 5 takes one of the common data extracted by the common data extraction unit 4 as the starting point and sets the name data held by the derivative database 2 as the vertex. , to generate all paths. The end point of the path may have the same common data as the start point, or may have different common data from the start point. Then, the path information extraction unit 5 extracts path information including the number of vertices, the name data of the included vertices, and the position on the path for each of those paths. For example, the path information extraction unit 5 can extract path information based on the undirected graph created by the graph creation unit 3 and the path identification information held by the derivative database 2 .
 パス作成部6は、パス情報抽出部5が抽出したパス情報で示される各パスについて、各共通データを端点とする、すなわち始点及び終点とする、部分パスを抽出する。そして、パス作成部6は、基礎データベース1が保持する情報に基づいて、各部分パスと一致する共通データを端点とし且つ規定の長さである全てのパスを数え上げる。例えば、パス作成部6は、グラフ作成部3が基礎データベース1から作成した無向グラフに基づいて、パスを列挙することができる。 The path creation unit 6 extracts partial paths having each common data as an end point, that is, a start point and an end point, for each path indicated by the path information extracted by the path information extraction unit 5 . Then, based on the information held by the basic database 1, the path creation unit 6 counts up all paths whose endpoints are common data that match each partial path and that have a prescribed length. For example, the path creation unit 6 can enumerate paths based on the undirected graph created by the graph creation unit 3 from the basic database 1 .
 対応付け部7は、例えば、編集距離等の文字列類似度に基づいて、パス作成部6が抽出した部分パスと数え上げたパスとから、頂点の名称データの組み合わせを探索する。そして、対応付け部7は、探索した組み合わせに基づいて、基礎データベース1が保持する名称データと派生データベース2が保持する名称データとを対応付ける。 The associating unit 7 searches for a combination of vertex name data from the partial paths extracted by the path creating unit 6 and the counted paths, for example, based on the character string similarity such as the edit distance. Then, the associating unit 7 associates the name data held by the basic database 1 with the name data held by the derivative database 2 based on the searched combinations.
 データ出力部8は、対応付け部7での対応付けの結果に基づいて、出力情報を生成し、それを出力する。例えば、データ出力部8は、対応付け部7での対応付けの結果に基づいて、名称データの対応関係を表す対応表を、出力情報として生成することができる。また、データ出力部8は、対応付け部7での対応付けの結果に基づいて基礎データベース1が保持している情報について名称データを変換して、新たなデータベースを作成し、これを出力情報とするようにしても良い。あるいは、データ出力部8は、対応付け部7での対応付けの結果に基づいて基礎データベース1及び派生データベース2が保持している情報を統合して、新たなデータベースを作成し、これを出力情報とするようにしても良い。 The data output unit 8 generates output information based on the result of association by the association unit 7 and outputs it. For example, the data output unit 8 can generate, as output information, a correspondence table representing the correspondence of name data based on the result of association by the association unit 7 . In addition, the data output unit 8 converts the name data of the information held in the basic database 1 based on the result of association by the association unit 7, creates a new database, and uses this as output information. You can make it work. Alternatively, the data output unit 8 integrates the information held by the basic database 1 and the derivative database 2 based on the result of association by the association unit 7, creates a new database, and outputs this as output information. You can also use
 図2は、名称データ対応付け装置のハードウェア構成の一例を示す図である。 FIG. 2 is a diagram showing an example of the hardware configuration of the name data association device.
 名称データ対応付け装置は、図2に示すように、例えばサーバコンピュータ(Server computer)やパーソナルコンピュータ(Personal computer)などのコンピュータにより構成され、CPU(Central Processing Unit)等のハードウェアプロセッサ101を有する。そして、名称データ対応付け装置では、このプロセッサ101に対し、プログラムメモリ102と、データメモリ103と、通信インタフェース104と、入出力インタフェース(図2では入出力IFと記す)105とが、バス106を介して接続される。 As shown in FIG. 2, the name data association device is composed of a computer such as a server computer or a personal computer, and has a hardware processor 101 such as a CPU (Central Processing Unit). In the name data association device, a program memory 102, a data memory 103, a communication interface 104, and an input/output interface (denoted as an input/output IF in FIG. connected through
 通信インタフェース104は、例えば一つ以上の有線または無線の通信モジュールを含むことができる。通信インタフェース104は、基礎データベース1及び/または派生データベース2が、LAN(Local Area Network)やインターネットなどのネットワークを介して接続されるデータサーバなどに構成される場合には、そのデータサーバなどとの間で通信を行い、それらデータサーバからデータを取得することができる。また、通信インタフェース104は、外部のデータ処理装置などと通信して、そのデータ処理装置からの要求を受信したり、その要求に応じたデータ処理結果をデータ処理装置に返信したりすることもできる。 The communication interface 104 can include, for example, one or more wired or wireless communication modules. If the basic database 1 and/or the derivative database 2 are configured in a data server or the like connected via a network such as a LAN (Local Area Network) or the Internet, the communication interface 104 is connected to the data server or the like. can communicate between and retrieve data from those data servers. Also, the communication interface 104 can communicate with an external data processing device or the like to receive a request from the data processing device, and can also send a data processing result corresponding to the request back to the data processing device. .
 入出力インタフェース105には、入力部107及び表示部108が接続されている。入力部107及び表示部108は、例えば液晶または有機EL(Electro Luminescence)を使用した表示デバイスの表示画面上に、静電方式または圧力方式を採用した入力検知シートを配置した、いわゆるタブレット型の入力・表示デバイスを用いたものが用いられることができる。なお、入力部107及び表示部108は独立するデバイスにより構成されても良い。入出力インタフェース105は、上記入力部107において入力された操作情報をプロセッサ101に入力すると共に、プロセッサ101で生成された表示情報を表示部108に表示させる。 An input unit 107 and a display unit 108 are connected to the input/output interface 105 . The input unit 107 and the display unit 108 are so-called tablet-type inputs, in which an input detection sheet adopting an electrostatic method or a pressure method is arranged on a display screen of a display device using liquid crystal or organic EL (Electro Luminescence), for example. - using a display device can be used; Note that the input unit 107 and the display unit 108 may be configured by independent devices. The input/output interface 105 inputs operation information input from the input unit 107 to the processor 101 and displays display information generated by the processor 101 on the display unit 108 .
 なお、入力部107及び表示部108は、入出力インタフェース105に接続されていなくても良い。入力部107及び表示部108は、通信インタフェース104と直接またはネットワークを介して接続するための通信ユニットを備えることで、プロセッサ101との間で情報の授受を行い得る。 Note that the input unit 107 and the display unit 108 do not have to be connected to the input/output interface 105 . The input unit 107 and the display unit 108 are provided with a communication unit for connecting to the communication interface 104 directly or via a network, so that information can be exchanged with the processor 101 .
 また、入出力インタフェース105は、フラッシュメモリなどの半導体メモリといった記録媒体のリード/ライト機能を有しても良いし、あるいは、そのような記録媒体のリード/ライト機能を持ったリーダライタとの接続機能を有しても良い。これにより、名称データ対応付け装置に対して着脱自在な記録媒体を、名称データを保持するデータベースとすることができる。入出力インタフェース105は、さらに、他の機器との接続機能を有して良い。 The input/output interface 105 may have a read/write function for a recording medium such as a semiconductor memory such as a flash memory, or may be connected to a reader/writer having a read/write function for such a recording medium. It may have functions. As a result, a recording medium detachable from the name data association device can be used as a database for holding name data. The input/output interface 105 may further have a connection function with other devices.
 プログラムメモリ102は、非一時的な有形のコンピュータ可読記憶媒体として、例えば、HDD(Hard Disk Drive)またはSSD(Solid State Drive)等の随時書込み及び読出しが可能な不揮発性メモリと、ROM等の不揮発性メモリとが組合せて使用されたものである。このプログラムメモリ102には、プロセッサ101が一実施形態に係る各種制御処理を実行するために必要なプログラムが格納されている。すなわち、上記のグラフ作成部3、共通データ抽出部4、パス情報抽出部5、パス作成部6、対応付け部7及びデータ出力部8の各部における処理機能部は、いずれも、プログラムメモリ102に格納されたプログラムを上記プロセッサ101により読み出させて実行させることにより実現され得る。なお、これらの処理機能部の一部または全部は、特定用途向け集積回路(ASIC:Application Specific Integrated Circuit)またはFPGA(field-programmable gate array)などの集積回路を含む、他の多様な形式によって実現されても良い。 The program memory 102 is a non-temporary tangible computer-readable storage medium, for example, a non-volatile memory such as a HDD (Hard Disk Drive) or SSD (Solid State Drive) that can be written and read at any time, and a non-volatile memory such as a ROM. It is used in combination with a static memory. The program memory 102 stores programs necessary for the processor 101 to execute various control processes according to one embodiment. That is, the processing function units in each of the above-described graph creation unit 3, common data extraction unit 4, path information extraction unit 5, path creation unit 6, association unit 7, and data output unit 8 are all stored in the program memory 102. It can be realized by causing the processor 101 to read and execute the stored program. Some or all of these processing functions may be implemented in various other forms, including integrated circuits such as Application Specific Integrated Circuits (ASICs) or field-programmable gate arrays (FPGAs). May be.
 データメモリ103は、有形のコンピュータ可読記憶媒体として、例えば、上記の不揮発性メモリと、RAM(Random Access Memory)等の揮発性メモリとが組合せて使用されたものである。このデータメモリ103は、各種処理が行われる過程で取得及び作成された各種データが記憶されるために用いられる。すなわち、データメモリ103には、各種処理が行われる過程で、適宜、各種データを記憶するための領域が確保される。そのような領域として、データメモリ103には、例えば、基礎データベース記憶部1031、派生データベース記憶部1032、一時記憶部1033及び出力情報記憶部1034を設けることができる。 The data memory 103 is a tangible computer-readable storage medium, for example, a combination of the above nonvolatile memory and a volatile memory such as RAM (Random Access Memory). This data memory 103 is used to store various data acquired and created in the process of performing various processes. That is, in the data memory 103, an area for storing various data is appropriately secured in the process of performing various processes. As such areas, the data memory 103 can be provided with, for example, a basic database storage unit 1031 , a derived database storage unit 1032 , a temporary storage unit 1033 and an output information storage unit 1034 .
 基礎データベース記憶部1031は、基礎データベース1の情報を記憶し、派生データベース記憶部1032は、派生データベース2の情報を記憶する。すなわち、基礎データベース1及び派生データベース2が、この基礎データベース記憶部1031及び派生データベース記憶部1032に構成されることができる。 The basic database storage unit 1031 stores information of the basic database 1, and the derived database storage unit 1032 stores information of the derived database 2. That is, the basic database 1 and the derivative database 2 can be configured in the basic database storage unit 1031 and the derivative database storage unit 1032 .
 図3は、基礎データベース記憶部1031に記憶される基礎データベース1が保持する情報の一例を示す図であり、図4は、派生データベース記憶部1032に記憶される派生データベース2が保持する情報の一例を示す図である。ここでは、名称データがビルの名称である例を示す。基礎データベース記憶部1031に記憶される基礎データベース1では、上位ビルと下位ビルは、隣接関係にある。派生データベース記憶部1032に記憶される派生データベース2では、同一のパス識別子(図では、識別子をIDと略記する)を持つビルの組み合わせが、1つのパス(新宿ビル→南新宿ビル→外苑ビル→四ツ谷ビル→新宿ビル)を構成している。以降、派生データベース2にあるビル名をci(i∈{1,2,…,n})で表し、基礎データベース1にあるビル名をdj(j∈{1,2,…,m})で表す。ここで、n及びmは、それぞれのデータベースにおけるビル名数である。 FIG. 3 is a diagram showing an example of information held by the basic database 1 stored in the basic database storage unit 1031, and FIG. 4 is an example of information held by the derived database 2 stored in the derived database storage unit 1032. It is a figure which shows. Here, an example is shown in which the name data is the name of a building. In the basic database 1 stored in the basic database storage unit 1031, the upper building and the lower building are adjacent to each other. In the derived database 2 stored in the derived database storage unit 1032, a combination of buildings having the same path identifier (identifier is abbreviated as ID in the figure) is represented by one path (Shinjuku Building → Minami-Shinjuku Building → Gaien Building → Yotsuya Building → Shinjuku Building). Hereinafter, the building names in the derived database 2 are denoted by c i (i ∈ {1, 2, ..., n}), and the building names in the basic database 1 are denoted by d j (j ∈ {1, 2, ..., m} ). where n and m are the number of building names in each database.
 これら基礎データベース記憶部1031及び派生データベース記憶部1032に記憶される情報は、例えば、プロセッサ101が入出力インタフェース105を介して受け取った、入力部107から入力された基礎データベース1及び派生データベース2の情報とすることができる。すなわち、基礎データベース1及び派生データベース2が、データメモリ103に構築されることができる。また、外部のデータサーバに構築された基礎データベース1及び派生データベース2が保持する情報の全部または一部を、基礎データベース記憶部1031及び派生データベース記憶部1032に記憶させるようにしても良い。この場合は、例えば、プロセッサ101は、入力部107からのユーザ操作による指示に応じて、データベースサーバに蓄積された情報を通信インタフェース104を介して取得して、それらを記憶部1031、1032に記憶させる。あるいは、プロセッサ101は、記録媒体に記録された情報を、入出力インタフェース105を介して取得しても良い。また、プロセッサ101は、外部のデータ処理装置などから基礎データベース1及び派生データベース2の情報と名称データの対応付け要求とを通信インタフェース104を介して受信して、それら受信したデータベースの情報を、処理対象の情報として記憶部1031、1032に記憶させるようにしても良い。 The information stored in the basic database storage unit 1031 and the derived database storage unit 1032 is, for example, the information of the basic database 1 and the derived database 2 input from the input unit 107 received by the processor 101 via the input/output interface 105. can be That is, a base database 1 and a derived database 2 can be constructed in the data memory 103 . Further, all or part of the information held by the basic database 1 and the derived database 2 constructed in an external data server may be stored in the basic database storage unit 1031 and the derived database storage unit 1032 . In this case, for example, the processor 101 acquires information accumulated in the database server via the communication interface 104 in response to an instruction by a user operation from the input unit 107, and stores them in the storage units 1031 and 1032. Let Alternatively, processor 101 may acquire information recorded on a recording medium via input/output interface 105 . The processor 101 also receives a request for associating the information of the basic database 1 and the derived database 2 with the name data from an external data processing device or the like via the communication interface 104, and processes the received database information. It may be stored in the storage units 1031 and 1032 as target information.
 一時記憶部1033は、プロセッサ101が、上記グラフ作成部3としての動作を実施した際に作成する無向グラフ、上記共通データ抽出部4としての動作を実施した際に抽出した共通データ、上記パス情報抽出部5としての動作を実施した際に抽出した全てのパスについてのパス情報、上記パス作成部6としての動作を実施した際に抽出した部分パス及び数え上げたパス、上記対応付け部7としての動作を実施した際に得られる名称データの対応付け結果、などを記憶する。 The temporary storage unit 1033 stores the undirected graph created when the processor 101 performs the operation as the graph creation unit 3, the common data extracted when the operation as the common data extraction unit 4 is performed, the path Path information about all paths extracted when the operation of the information extraction unit 5 is performed, partial paths and counted paths extracted when the operation of the path creation unit 6 is performed, and the correspondence unit 7 Stores the result of association of name data obtained when performing the operation of .
 出力情報記憶部1034は、プロセッサ101が上記データ出力部8としての動作を実施した際に得られる出力情報を記憶する。 The output information storage unit 1034 stores output information obtained when the processor 101 operates as the data output unit 8 described above.
 (動作)
 次に、名称データ対応付け装置の動作を説明する。
(motion)
Next, the operation of the name data association device will be described.
 図5は、名称データ対応付け装置における名称データの対応付けに係わる処理動作の一例を示すフローチャートである。ここでは、既に、基礎データベース記憶部1031には基礎データベース1の情報が記憶され、派生データベース記憶部1032には派生データベース2の情報が記憶されているものとする。入出力インタフェース105を介して入力部107から、あるいは、通信インタフェース104を介して外部のデータ処理装置から、名称データの対応付けの実施を指示されると、名称データ対応付け装置のプロセッサ101は、このフローチャートに示す動作を開始する。 FIG. 5 is a flow chart showing an example of a processing operation related to association of name data in the name data association device. Here, it is assumed that the information of the basic database 1 is already stored in the basic database storage unit 1031 and the information of the derivative database 2 is already stored in the derivative database storage unit 1032 . When the input unit 107 via the input/output interface 105 or an external data processing device via the communication interface 104 instructs to perform name data matching, the processor 101 of the name data matching device The operation shown in this flow chart is started.
 まず、プロセッサ101は、グラフ作成部3としての動作を実行する。すなわち、プロセッサ101は、派生データベース記憶部1032に記憶された派生データベース2の情報と基礎データベース記憶部1031に記憶された基礎データベース1の情報とのそれぞれについて、隣接情報を使用して、名称データを頂点とする閉路グラフGc及びGdを生成する(ステップS1)。生成された閉路グラフGc及びGdは、データメモリ103の一時記憶部1033に記憶される。 First, the processor 101 performs the operation as the graph creation unit 3 . That is, the processor 101 uses the adjacency information for each of the information of the derived database 2 stored in the derived database storage unit 1032 and the information of the basic database 1 stored in the basic database storage unit 1031 to extract the name data. Cycle graphs G c and G d to be vertices are generated (step S1). The generated cycle graphs G c and G d are stored in the temporary storage unit 1033 of the data memory 103 .
 派生データベース2にあるビル名ci及び基礎データベース1にあるビル名djをそれぞれ頂点とし、隣接関係にある頂点同士は辺で結ばれていると解釈すると、以下のように、無向グラフである閉路グラフGc及びGdが構築できる。ここで、閉路とは、閉路グラフGcの部分グラフであり、始点と終点が同一頂点であるようなパスを指す。 Building name c i in the derived database 2 and building name d j in the basic database 1 are each taken as vertices, and if it is interpreted that adjacent vertices are connected by edges, the following undirected graph is Certain cycle graphs G c and G d can be constructed. Here, a cycle is a subgraph of the cycle graph G c and indicates a path whose start point and end point are the same vertex.
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
   Ed: 基礎データベース1の隣接情報から得られる辺の集合
   gd: Ed→P(Vd) Edの元に頂点集合Vdの部分集合を対応させる写像。ただし、P(Vd)は頂点集合Vdの冪集合である
      Gd :=(gd,Vd,Ed
E d : A set of edges obtained from the adjacency information of the basic database 1 g d : E d →P(V d ) A mapping that associates a subset of the vertex set V d with an element of Ed . However, P(V d ) is a power set of the vertex set V d G d :=(g d , V d , Ed )
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000002
   Ec: 派生データベース2の隣接情報から得られる辺の集合
   gc: Ec→P(Vc) Ecの元に頂点集合Vcの部分集合を対応させる写像。ただし、P(Vc)は頂点集合Vcの冪集合である
      Gc :=(gc,Vc,Ec
E c : A set of edges obtained from the adjacency information of the derivative database 2 g c : E c →P(V c ) A mapping that associates a subset of the vertex set V c with the elements of E c . However, P(V c ) is a power set of the vertex set V c G c :=(g c , V c , E c )
 また、名称データ対応付け装置のプロセッサ101は、共通データ抽出部4としての動作を実行する。すなわち、プロセッサ101は、基礎データベース記憶部1031に記憶された基礎データベース1の情報と派生データベース記憶部1032に記憶された派生データベース2の情報とで、共通する名称データを抽出する(ステップS2)。抽出され共通する名称データは、データメモリ103の一時記憶部1033に記憶される。 Also, the processor 101 of the name data association device operates as the common data extraction unit 4 . That is, the processor 101 extracts common name data between the information of the basic database 1 stored in the basic database storage unit 1031 and the information of the derived database 2 stored in the derived database storage unit 1032 (step S2). The extracted common name data is stored in temporary storage section 1033 of data memory 103 .
 次に、プロセッサ101は、パス情報抽出部5としての動作を実行する。すなわち、プロセッサ101は、共通する名称データと派生データベース2が格納するパス識別情報とに基づいて、派生データベース2の閉路グラフGcからパスΓk(k∈{1,2,…,K}、Kは閉路グラフGc内のパスの総数)を抽出する(ステップS3)。抽出されたパスΓkを示すパス情報は、データメモリ103の一時記憶部1033に記憶される。パス情報は、抽出されたパスΓkの頂点数、含まれる頂点の名称データ及びそのパス上の位置を含むことができる。 Next, the processor 101 executes the operation as the path information extractor 5. FIG. That is, the processor 101 extracts paths Γ k ( k∈ {1, 2, . . . , K}, K is the total number of paths in the cycle graph Gc ) (step S3). Path information indicating the extracted path Γ k is stored in the temporary storage unit 1033 of the data memory 103 . The path information can include the number of vertices of the path Γ k extracted, the name data of the included vertices and their positions on the path.
 ここで、パスΓkとは、派生データベース2の閉路グラフGcにおける頂点sk∈Vcを始点とするk番目のパスである。 
   Γk[l]: Γkを構成する頂点のうちl番目の頂点 (第l要素)
   |Γk|: パスΓkの長さ (パスΓkを構成する頂点の個数)
      Γk=(sk,…,tk),
      (Γk[l],Γk[l+1])∈Ec
      l∈{1,2,…,|Γk|}
Here, the path Γ k is the k-th path starting from the vertex s k εV c in the cycle graph G c of the derived database 2 .
Γ k [l]: l-th vertex (l-th element) among the vertices that compose Γ k
k |: Length of path Γ k (Number of vertices forming path Γ k )
Γk =( sk ,..., tk ),
k [l], Γ k [l+1])∈E c ,
l∈{1, 2, . . . |Γ k |}
 閉路グラフGcに対しパスはいくつあっても良いとするが、いずれのパスも、以下の3条件を満たすとする。 
 1.全てのskに対してsk=dj,tk=dlを満たすdj∈Vd,dlが存在する。 
 2.パスΓkを構成する全ての辺は、Ecに存在する。 
 3.∀ci∈Vcは、いずれかのパスΓkに所属している。
It is assumed that any number of paths may exist in the cycle graph Gc , but any path satisfies the following three conditions.
1. There exists d j ∈V d ,d l satisfying s k =d j ,t k =d l for all s k .
2. All edges that make up the path Γ k are in E c .
3. ∀c i ∈V c belongs to some path Γ k .
 ここで、ステップS2で抽出された、VcとVdで同一表記であるようなビル名の集合をS :={ci∈Vc|∃dj∈Vd s.t. ci=dj}とする。名称データ対応付け装置は、この集合Sの要素ではない各ci,djに対し、以下のようにして、閉路グラフGc及びGdを用いた対応付けを行う。 Here, let S :={c i ∈V c |∃d j ∈V d st c i =d j } be the set of building names having the same notation in V c and V d extracted in step S2. and The name data association device associates c i and d j that are not elements of the set S using cycle graphs G c and G d as follows.
 ここで、閉路グラフGcを構成するパスの一つをΓkと表記し、パスΓkを構成する頂点のうち、集合Sに含まれる頂点の配列をIkとし、以下で定義する。 
      Ik :=(Γk[i]|Γk[i]∈S,Γk[i]≠sk,i=1,2,…,|Γk|)
Here, one of the paths forming the cycle graph G c is denoted by Γ k , and the array of vertices included in the set S among the vertices forming the path Γ k is I k , which are defined below.
I k :=(Γ k [i]|Γ k [i]∈S, Γ k [i]≠s k , i=1, 2, . . . │Γ k |)
 次に、プロセッサ101は、パス作成部6としての動作を実行する。 
 すなわち、プロセッサ101は、まず、上記ステップS3で抽出されたパス情報に基づいて、一つのパスΓkに対して、閉路グラフGcから集合Sの各要素を端点とする部分パス
Next, the processor 101 performs the operation as the path creating section 6. FIG.
That is, the processor 101 first extracts a partial path whose endpoints are the elements of the set S from the cycle graph G c for one path Γ k based on the path information extracted in step S3.
Figure JPOXMLDOC01-appb-M000003
Figure JPOXMLDOC01-appb-M000003
を抽出する(ステップS4)。すなわち、Lk iとは、パスΓkにおいて、頂点lk[i]から頂点lk[i+1]までの部分パスのことである。また、lk[i]は、配列lkのi番目の要素である。 is extracted (step S4). That is, L k i is a partial path from vertex l k [i] to vertex l k [i+1] in path Γ k . Also, l k [i] is the i-th element of the array l k .
 次に、プロセッサ101は、抽出された部分パスに基づいて、基礎データベース1の閉路グラフGdにおいて始点がlk[i]、終点がlk[i+1]であるようなパスのうち、長さが、|Lk i|以上(i=1,2,…,|Γk})且つ|Lk i|+x以下であるものを全て数え上げる(ステップS5)。ここで、xはユーザが指定した0以上の正整数である。なお、このパスを列挙する際に、同じ頂点及び辺を2回通ることはない。列挙された部分パスの集合を Next, the processor 101 selects, based on the extracted partial paths, among the paths whose start point is l k [i] and whose end point is l k [i+1] in the cycle graph G d of the basic database 1, All the lengths of |L k i | (i=1, 2, . . . , |Γ k }) and less than |L k i |+x are counted (step S5). Here, x is a positive integer greater than or equal to 0 specified by the user. Note that the same vertex and edge are not passed twice when enumerating this path. Let the set of enumerated partial paths be
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000004
と表記する。 is written as
 次に、プロセッサ101は、対応付け部7としての動作を実行する。 
 すなわち、プロセッサ101は、まず、上記ステップS5で数え上げられた、パスの集合Ak iの中で、長さが|Lk i|のものがある場合、そのパスをαとする。この下で、以下のように名称の組み合わせを選出する(ステップS6)。 
     (Lk i[j],α[j]),j=1,2,…,|Lk i
ただし、Lk i[j],α[j]は、各パスのj番目の頂点である。
Next, the processor 101 operates as the associating unit 7 .
That is, the processor 101, first, if there is a path of length |L k i | among the set A k i of paths counted in step S5, sets that path to α. Under this, a combination of names is selected as follows (step S6).
(L k i [j], α[j]), j=1, 2, . . . |L k i |
where L k i [j], α[j] is the j-th vertex of each path.
 さらに、プロセッサ101は、それら選出した組み合わせの中で、長さが|Lk i|より長く且つ|Lk i|+x以下のものがある場合、図6のように文字列類似度(例えば、編集距離)に基づく名称検索技術を用いて、名称データの組み合わせを探索して対応付け、その結果をデータメモリ103の一時記憶部1033に記憶する(ステップS7)。編集距離は、例えば、D. Gusfield. "Algorithms on strings, trees and sequences: computer science and computational biology." Cambridge university press, 1997.に開示されている。 Furthermore, if there is a combination whose length is longer than |L k i | Using a name search technique based on edit distance), a combination of name data is searched and associated, and the result is stored in the temporary storage unit 1033 of the data memory 103 (step S7). Edit distance is disclosed, for example, in D. Gusfield. "Algorithms on strings, trees and sequences: computer science and computational biology." Cambridge university press, 1997.
 図6は、名称の対応付け方法を説明するための模式図である。基礎データベース1に格納されたビルBLdの名称データ(Aビル、Bビル、…nビル)と派生データベース2に格納されたビルBLcの名称(αビル、βビル、…νビル)があり、同図に一点鎖線で示すように、同一名称または別のパスにより、Aビルとαビル、nビルとνビルが既に対応付けがされているとする。このような場合、プロセッサ101は、以下の手順で、名称データの組み合わせを探索することができる。 FIG. 6 is a schematic diagram for explaining a method of associating names. There are name data of buildings BL d stored in the basic database 1 (A building, B building, ... n building) and names of buildings BL c stored in the derivative database 2 (α building, β building, ... ν building). , as indicated by the dashed line in the same figure, it is assumed that the A building and the α building and the n building and the ν building are already associated with the same name or different paths. In such a case, processor 101 can search for a combination of name data in the following procedure.
 1.x=0と初期化する。 
 2.長さ|Lk i|+xのパスを数え上げる。 
 3.数え上げた1パスの頂点の中から、既に対応付けできている名称を除く。 
 4.得られたパスの長さが|Lk i|より大きい場合、派生データベース2のビルBLcの内、未だ対応付けできていないビル(γビル)から、基礎データベース1のビルBLdへ、編集距離が最短であるビルを求める。例えば、γビルから編集距離が最短のビルとして、実線矢印で示すように、Cビルが探索され、対応付けされることができる。 
 5.x=x+1として、予めユーザが指定したxの上限値になるまで、上記2~4を繰り返す。
なお、編集距離が最短のビルBLdを探索する際には、破線矢印で示すように、既に対応付けされたビルBLdの次のビルから探索が開始される。
1. Initialize x=0.
2. Enumerate paths of length |L k i |+x.
3. Exclude the names that have already been associated from the counted 1-path vertices.
4. If the length of the obtained path is larger than |L k i |, edit from the building (γ building) that has not yet been associated among the buildings BL c of the derived database 2 to the building BL d of the basic database 1. Find the building with the shortest distance. For example, as the building with the shortest edit distance from the γ building, the C building can be searched and associated as indicated by the solid arrow.
5. With x=x+1, the above 2 to 4 are repeated until the upper limit value of x previously specified by the user is reached.
When searching for the building BL d with the shortest edit distance, the search starts from the building next to the already associated building BL d , as indicated by the dashed arrow.
 以上のようにして、組み合わせが1通りしかないものについてはそのまま出力結果とし、それ以外のものについては、既に出力結果が得られている名称データを候補から除外する。除外して残った候補のうち、整合性のとれるものを残して対応付け結果とする。ここでの整合性とは、ある名称Aについて複数の候補名がある下で、候補名の中で前述の操作で除外された名称Bがあるときには、除外された名称Bと、名称Aの組み合わせ(A,B)を出力する根拠となったパスPが必ず存在する。このパスPからは、名称Aとは別の名称Cに対しても、名称の組み合わせ(C,D)を与えている。名称組み合わせ(A,B)が除外されたことで、組み合わせ(C,D)もまた除外する。より具体的な例は、動作例として、後述する。 As described above, the name data for which there is only one combination is output as it is, and for other names, the name data for which the output result has already been obtained is excluded from the candidates. Among the candidates left after the exclusion, those with consistency are left as the association result. Consistency here means that when there are multiple candidate names for a certain name A and there is a name B excluded by the above operation among the candidate names, a combination of the excluded name B and the name A There always exists a path P that serves as a basis for outputting (A, B). From this path P, a name combination (C, D) is also given to a name C different from the name A. Since the name combination (A,B) has been excluded, the combination (C,D) is also excluded. A more specific example will be described later as an operation example.
 こうして、一つのパスΓkについての処理が終了したならば、プロセッサ101は、上記ステップS3で抽出されたパス情報に基づくパスΓkの全てを処理したか否か判断する(ステップS8)。すなわち、全てのパスΓkの全ての頂点について処理を終了したか判断する。未だ処理していないパスΓkが存在すると判断した場合には、プロセッサ101は、kを更新して、上記ステップS4の処理に移行して、上記ステップS4乃至ステップS7の処理を繰り返す。 When the processing of one pass Γ k is completed in this way, the processor 101 determines whether or not all of the paths Γ k have been processed based on the path information extracted in step S3 (step S8). That is, it is determined whether the processing has been completed for all vertices of all paths Γ k . If it is determined that there is a path Γ k that has not yet been processed, the processor 101 updates k, shifts to the process of step S4, and repeats the processes of steps S4 to S7.
 そして、上記ステップS8においてパスΓkの全てを処理したと判断した場合、プロセッサ101は、データ出力部8としての動作を実行することで、名称データの対応付け情報を出力する(ステップS9)。すなわち、プロセッサ101は、入力部107からまたは外部のデータ処理装置から指示された形態の出力情報をデータメモリ103の一時記憶部1033に記憶された対応付け結果から生成し、その生成した出力情報をデータメモリ103の出力情報記憶部1034に記憶させる。そして、プロセッサ101は、この出力情報を、入出力インタフェース105を介して表示部108により表示したり、通信インタフェース104を介して外部のデータ処理装置に送信したりすることができる。 If it is determined in step S8 that all paths Γ k have been processed, the processor 101 operates as the data output unit 8 to output name data association information (step S9). That is, processor 101 generates output information in a form instructed from input unit 107 or from an external data processing device from the association results stored in temporary storage unit 1033 of data memory 103, and outputs the generated output information. Stored in the output information storage unit 1034 of the data memory 103 . The processor 101 can display this output information on the display unit 108 via the input/output interface 105, or can transmit it to an external data processing device via the communication interface 104. FIG.
 以上に説明した一実施形態に係る名称データ対応付け装置は、パス作成部6により、共通データを端点とし且つ非共通データを端点間の頂点とする部分パスを抽出し、部分パスそれぞれについて、端点と同じ共通データの端点を持ち且つ部分パスの長さ以上の長さを持つパスを作成し、対応付け部7により、この部分パスそれぞれについて、部分パス上の各頂点とパス上の頂点との組み合わせを探索することで、基礎データベース1が保持する名称データと派生データベース2が保持する名称データとを対応付ける。これにより、統合するデータベース間で表記ゆれが存在する同義の名称データを、名称データに対応する文字列別データがデータベース間で対応関係を有していなくとも、人的稼働を掛けず正確に対応付けることができる。よって、異なるデータベース間で、ある事柄に対して漏れのない情報収集を行うことが可能となる。また、人的稼働の削減により、業務効率を上げる効果が期待できる。 In the name data association device according to the embodiment described above, the path creation unit 6 extracts partial paths having common data as endpoints and non-common data as vertices between the endpoints. and a path having a length equal to or greater than the length of the partial path. By searching for a combination, the name data held by the basic database 1 and the name data held by the derived database 2 are associated with each other. As a result, synonymous name data that has spelling variations between databases to be integrated can be accurately matched without human intervention, even if the character string data corresponding to the name data does not have a corresponding relationship between databases. be able to. Therefore, it is possible to collect information without omission on a certain matter between different databases. In addition, the effect of improving work efficiency can be expected by reducing human operations.
 なお、一実施形態に係る名称データ対応付け装置は、グラフ作成部3により、名称データを頂点とする基礎データベース1及び派生データベース2の無向グラフである閉路グラフGd及びGcを作成し、パス情報抽出部5によって、共通データを端点とし且つ派生データベース2が保持する名称データを頂点とする全てのパスΓkを生成し、それらのパスΓkそれぞれについて、頂点数、含まれる頂点の名称データ及びそのパス上の位置を含むパス情報を抽出する。そして、パス作成部6は、このパスΓkの1つに対して、前記パス情報に基づいて、閉路グラフGcから部分パスを抽出し、この部分パスそれぞれについて、閉路グラフGdから、部分パスの端点と同じ共通データの端点を持ち且つ部分パスが有する頂点の頂点数以上の頂点を含むパスを作成する。よって、基礎データベース1が保持している名称データのうち、派生データベース2が保持している名称データと対応付けできる可能性がある頂点を含む、換言すれば、派生データベース2が保持している名称データと対応付けできる可能性の無い頂点を除外した、パスを作成することができる。 In the name data association device according to one embodiment, the graph creation unit 3 creates cycle graphs Gd and Gc , which are undirected graphs of the basic database 1 and the derivative database 2, with the name data as vertices, The path information extracting unit 5 generates all paths Γ k whose endpoints are the common data and whose vertices are the name data held in the derived database 2 , and for each of these paths Γ k , the number of vertices and the names of the vertices included. Extract the path information, including the data and its position on the path. Then, for one of the paths Γ k , the path creation unit 6 extracts partial paths from the cycle graph G c based on the path information, and extracts partial paths from the cycle graph G d for each of these partial paths. Create a path that has endpoints of common data that are the same as the endpoints of the path, and that includes vertices equal to or greater than the number of vertices that the partial path has. Therefore, among the name data held by the basic database 1, vertices that may be associated with the name data held by the derived database 2 are included. A path can be created that excludes vertices that have no possibility of being associated with data.
 さらにここで、パス作成部6は、作成するパスとして、パスΓkの頂点数以上であり且つ頂点数に対してユーザが指定した個数以下の頂点数を含むパスを作成する。よって、パスが含む頂点数を制限することで、処理時間の短縮化を図れる。 Further, here, the path creation unit 6 creates a path including the number of vertices equal to or greater than the number of vertices of the path Γ k and equal to or less than the number of vertices specified by the user. Therefore, by limiting the number of vertices included in the path, the processing time can be shortened.
 また、一実施形態に係る名称データ対応付け装置においては、対応付け部7は、パス作成部6によって作成したパス上の頂点それぞれについて、パス上の位置が部分パス上の頂点に対応する場合には、基礎データベース1が保持する名称データのうちのパス上の頂点に対応する名称データを、派生データベース2が保持する名称データのうちの部分パス上の頂点の名称データに対応付ける。また、パス上の位置が部分パス上の頂点に対応しない場合には、対応付け部7は、名称データ同士の文字列類似度に基づいて、基礎データベース1が保持する名称データのうちのパス上の頂点に対応する名称データと、派生データベース2が保持する名称データのうちの部分パス上の頂点の名称データとを対応付ける。よって、基礎データベース1が保持する名称データを、派生データベース2が保持する名称データに容易に対応付けすることができる。 Further, in the name data associating device according to the embodiment, the associating unit 7, for each vertex on the path created by the path creating unit 6, when the position on the path corresponds to the vertex on the partial path, associates the name data corresponding to the vertex on the path among the name data held by the basic database 1 with the name data for the vertex on the partial path among the name data held by the derivative database 2 . Also, if the position on the path does not correspond to the vertex on the partial path, the associating unit 7 selects the position on the path among the name data held by the basic database 1 based on the character string similarity between the name data. name data corresponding to the vertices of , and name data of the vertices on the partial path among the name data held by the derived database 2 are associated with each other. Therefore, the name data held by the basic database 1 can be easily associated with the name data held by the derivative database 2 .
 また、一実施形態に係る名称データ対応付け装置は、パス情報抽出部5が生成したパスΓkの全てに対する処理が終了するまで、パス作成部6及び対応付け部7の処理を繰り返す。よって、派生データベース2が保持する名称データが基礎データベース1が保持する名称データと対応付けし損なう確率を減少させることができる。 Further, the name data association device according to one embodiment repeats the processing of the path creation unit 6 and the association unit 7 until the processing for all paths Γ k generated by the path information extraction unit 5 is completed. Therefore, it is possible to reduce the probability that the name data held by the derivative database 2 fails to be associated with the name data held by the basic database 1 .
 また、一実施形態に係る名称データ対応付け装置は、データ出力部8により、名称データの対応付けの結果に基づいて、名称データの対応表を含む出力情報を生成する。よって、この出力情報を利用して、データベースの統合処理を実施することが可能となる。また、一実施形態に係る名称データ対応付け装置は、出力情報として、統合したデータベースの情報を生成しても良い。 In addition, the name data association device according to one embodiment uses the data output unit 8 to generate output information including a name data correspondence table based on the result of name data association. Therefore, by using this output information, it is possible to perform database integration processing. Further, the name data association device according to one embodiment may generate information of the integrated database as the output information.
 [動作例]
 本実施形態の動作例として、適用した名称データの概要と結果を説明する。
[Example of operation]
As an operation example of the present embodiment, an overview of applied name data and results will be described.
 図7は、動作例において基礎データベース記憶部1031に記憶される基礎データベース1が保持する情報の一例を示す図である。この基礎データベースから得られる隣接情報は、以下の通りである。ここで、(A,B)という表記は、データ名Aとデータ名Bとは繋がりがあることを示すものとする。 
   ・(福岡花園ビル、立子山ビル)
   ・(立子山ビル、福山伊達ビル)
   ・(福山伊達ビル、桑原ビル)
   ・(桑原ビル、福井藤田ビル)
   ・(福井藤田ビル、福地梁川ビル)
   ・(福地梁川ビル、保科ビル)
   ・(保科ビル、恐山ビル)
   ・(保科ビル、福岡花園ビル)
   ・(恐山ビル、福岡花園ビル)
   ・(恐山ビル、月舘ビル)
   ・(月舘ビル、福島川俣ビル)
   ・(福島川俣ビル、福岡花園ビル)
FIG. 7 is a diagram showing an example of information held by the basic database 1 stored in the basic database storage unit 1031 in the operation example. Neighborhood information obtained from this basic database is as follows. Here, the notation (A, B) indicates that data name A and data name B are connected.
・(Fukuoka Hanazono Building, Ritsukoyama Building)
・(Ritsukoyama Building, Fukuyama Date Building)
・(Fukuyama Date Building, Kuwabara Building)
・(Kuwabara Building, Fukui Fujita Building)
・(Fukui Fujita Building, Fukuchi Yanagawa Building)
・(Fukuchi Yanagawa Building, Hoshina Building)
・(Hoshina Building, Osorezan Building)
・(Hoshina Building, Fukuoka Hanazono Building)
・(Osorezan Building, Fukuoka Hanazono Building)
・(Osorezan Building, Tsukidate Building)
・(Tsukidate Building, Fukushima Kawamata Building)
・(Fukushima Kawamata Building, Fukuoka Hanazono Building)
 図8は、動作例において派生データベース記憶部1032に記憶される派生データベース2が保持する情報の一例を示す図である。この派生データベースから得られる隣接情報は、以下の通りである。この隣接情報のパスがΓkであり、本動作例で扱うパスは一本なのでk=1とする。ここで、(A→B)という表記は、データ名Aからデータ名Bへパスがあることを示すものとする。 
   ・(花園ビル→伊達ビル)
   ・(伊達ビル→桑原ビル)
   ・(桑原ビル→藤田ビル)
   ・(藤田ビル→梁川ビル)
   ・(梁川ビル→保科ビル)
   ・(保科ビル→恐山ビル)
   ・(恐山ビル→月館ビル)
   ・(月館ビル→川俣ビル)
   ・(川俣ビル→花園ビル)
FIG. 8 is a diagram showing an example of information held by the derivative database 2 stored in the derivative database storage unit 1032 in the operation example. Neighborhood information obtained from this derived database is as follows. Since the path of this adjacent information is Γ k and the number of paths handled in this operation example is one, let k=1. Here, the notation (A→B) indicates that there is a path from data name A to data name B. FIG.
・(Hanazono Building → Date Building)
・(Date Building→Kuwabara Building)
・(Kuwabara Building→Fujita Building)
・(Fujita Building → Yanagawa Building)
・(Yanagawa Building → Hoshina Building)
・(Hoshina Building → Osorezan Building)
・(Osorezan Building→Tsukikan Building)
・(Tsukikan Building → Kawamata Building)
・(Kawamata Building → Hanazono Building)
 この動作例では、パスID=2の名称データに関して、頂点集合Vc及びVdは、以下の通りである。 
      Vc={福岡花園ビル、立子山ビル、福山伊達ビル、桑原ビル、福井藤田ビル、福地梁川ビル、保科ビル、恐山ビル、月舘ビル、福島川俣ビル}
      Vd={花園ビル、伊達ビル、桑原ビル、藤田ビル、梁川ビル、保科ビル、恐山ビル、月館ビル、川俣ビル}
In this operation example, the vertex sets V c and V d for the name data of path ID=2 are as follows.
V c = {Fukuoka Hanazono Building, Tatsukoyama Building, Fukuyama Date Building, Kuwabara Building, Fukui Fujita Building, Fukuchi Yanagawa Building, Hoshina Building, Osorezan Building, Tsukidate Building, Fukushima Kawamata Building}
V d = {Hanazono Building, Date Building, Kuwabara Building, Fujita Building, Yanagawa Building, Hoshina Building, Osorezan Building, Tsukikan Building, Kawamata Building}
 そして、名称データの正確な表記の組み合わせ、すなわち名称データの対応付けは、この動作例では、次の通りである。 
      {(月舘ビル、月館ビル),(福島川俣ビル、川俣ビル),(福岡花園ビル、花園ビル),(福山伊達ビル、伊達ビル),(福井藤田ビル、藤田ビル),(福地梁川ビル、梁川ビル)}
In this operation example, the combination of correct descriptions of the name data, that is, the association of the name data is as follows.
{(Tsukidate Building, Tsukikan Building), (Fukushima Kawamata Building, Kawamata Building), (Fukuoka Hanazono Building, Hanazono Building), (Fukuyama Date Building, Date Building), (Fukui Fujita Building, Fujita Building), (Fukuchi Yanagawa Building , Yanagawa Building)}
 実施形態に係る名称データ対応付け装置がこの対応付けを正しく行うことができるか確認した。 We checked whether the device for associating name data according to the embodiment can perform this association correctly.
 名称データ対応付け装置のプロセッサ101は、ステップS1において、グラフ作成部3としての動作を実施し、閉路グラフを作成する。図9は、動作例において基礎データベース1が保持する情報から作成された閉路グラフGdの一例を示す模式図である。 In step S1, the processor 101 of the name data association device operates as the graph creation unit 3 to create a cycle graph. FIG. 9 is a schematic diagram showing an example of the cycle graph Gd created from the information held by the basic database 1 in the operation example.
 また、プロセッサ101は、ステップS2において、共通データ抽出部4としての動作を実施し、閉路グラフGcと閉路グラフGdとで共通する名称データを抽出する。ここで、同一表記であるような名称データつまりビル名集合Sは、以下の通りである。 
      S :={桑原ビル、保科ビル、恐山ビル}
Further, in step S2, the processor 101 operates as the common data extraction unit 4 to extract name data common to the cycle graph Gc and the cycle graph Gd . Here, the name data having the same notation, that is, the building name set S is as follows.
S := {Kuwabara Building, Hoshina Building, Osorezan Building}
 そこで、プロセッサ101は、ステップS3において、パス情報抽出部5としての動作を実施して派生データベース2からパス情報を抽出し、ステップS4において、パス作成部6としての動作を実施して部分パスを抽出する。図10は、動作例において派生データベース2が保持する情報から作成された閉路グラフGcから生成されたパスΓ1の一例を示す模式図である。プロセッサ101は、閉路グラフGcからビル名集合Sの各要素を端点とする部分パス Therefore, in step S3, the processor 101 operates as the path information extraction unit 5 to extract path information from the derivative database 2, and in step S4, operates as the path creation unit 6 to generate a partial path. Extract. FIG. 10 is a schematic diagram showing an example of the path Γ1 generated from the cycle graph Gc created from the information held by the derivative database 2 in the operation example. The processor 101 extracts a partial path from the cycle graph G c whose endpoints are the elements of the building name set S.
Figure JPOXMLDOC01-appb-M000005
Figure JPOXMLDOC01-appb-M000005
を抽出する。 
      L1 1 :=(桑原ビル、藤田ビル、梁川ビル、保科ビル)
      L1 2 :=(保科ビル、恐山ビル)
      L1 3 :=(恐山ビル、月館ビル、川俣ビル、花園ビル、伊達ビル、桑原ビル)
to extract
L 1 1 := (Kuwabara Building, Fujita Building, Yanagawa Building, Hoshina Building)
L 1 2 := (Hoshina Building, Osorezan Building)
L 1 3 := (Osorezan Building, Tsukikan Building, Kawamata Building, Hanazono Building, Date Building, Kuwabara Building)
 そして、プロセッサ101は、ステップS5において、部分パスL1 1について、閉路グラフGd上で、「桑原ビル」と「保科ビル」を端点に持つ、長さ3以上、3+x以下のパスを数え上げる。 Then, in step S5, the processor 101 counts paths of length 3 or more and 3+x or less having "Kuwabara Building" and "Hoshina Building" as endpoints on the cycle graph Gd for the partial path L11.
 今回は動作例のため、パラメータx=1とする。すると、
      長さ3: (桑原ビル、福井藤田ビル、福地梁川ビル、保科ビル)
      長さ4: (桑原ビル、福山伊達ビル、立子山ビル、福岡花園ビル、保科ビル)
となる。
Since this is an operation example, the parameter x=1. Then,
Length 3: (Kuwabara Building, Fukui Fujita Building, Fukuchi Yanagawa Building, Hoshina Building)
Length 4: (Kuwabara Building, Fukuyama Date Building, Ritsukoyama Building, Fukuoka Hanazono Building, Hoshina Building)
becomes.
 長さ3の場合について、数え上げたパスと部分パスL1 1との対応する頂点名を組合せると、
      (福井藤田ビル、藤田ビル)、(福地梁川ビル、梁川ビル)
の候補を得る。
Combining the corresponding vertex names of the enumerated path and the partial path L 1 1 for the length 3 case yields
(Fukui Fujita Building, Fujita Building), (Fukuchi Yanagawa Building, Yanagawa Building)
get candidates for
 長さ4の場合は、どの組み合わせも編集距離が1であるので、
      「藤田ビル」に対し、候補:「福山伊達ビル」、「立子山ビル」、「福井藤田ビル」
      「梁川ビル」に対し、候補:「福岡花園ビル」、「立子山ビル」、「福地梁川ビル」
が考えられる。
For length 4, any combination has an edit distance of 1, so
Candidates for "Fujita Building": "Fukuyama Date Building", "Ritsukoyama Building", "Fukui Fujita Building"
Candidates for "Yanagawa Building": "Fukuoka Hanazono Building", "Ritsukoyama Building", "Fukuchi Yanagawa Building"
can be considered.
 部分パスL1 2については、長さ1なので、省略する。 Partial path L 1 2 is omitted because it has length 1.
 部分パスL1 3について、プロセッサ101は、閉路グラフGd上で、「恐山ビル」と「桑原ビル」を端点に持つ、長さ5以上、5+x=6以下のパスを数え上げる。すると、
      長さ5:該当なし
      長さ6:(恐山ビル、月舘ビル、福島川俣ビル、福岡花園ビル、立子山ビル、福山伊達ビル、桑原ビル)
を得る。
For the partial path L 1 3 , the processor 101 counts paths having a length of 5 or more and 5+x=6 or less having "Osorezan Building" and "Kuwabara Building" as endpoints on the cycle graph G d . Then,
Length 5: N/A Length 6: (Osorezan Building, Tsukidate Building, Fukushima Kawamata Building, Fukuoka Hanazono Building, Ritsukoyama Building, Fukuyama Date Building, Kuwabara Building)
get
 長さ6のパスから、部分パスL1 3の各頂点と編集距離が最短である点を選択すると、
      (月舘ビル、月館ビル)、(福島川俣ビル、川俣ビル)、(福岡花園ビル、花園ビル)、(福山伊達ビル、伊達ビル)
の候補を得る。
From the path of length 6, selecting the point with the shortest edit distance from each vertex of the partial path L 1 3 ,
(Tsukidate Building, Tsukikan Building), (Fukushima Kawamata Building, Kawamata Building), (Fukuoka Hanazono Building, Hanazono Building), (Fukuyama Date Building, Date Building)
get candidates for
 以上より、候補数が1つである組み合わせ=回答、とするので、
      (月舘ビル、月館ビル)、(福島川俣ビル、川俣ビル)、(福岡花園ビル、花園ビル)、(福山伊達ビル、伊達ビル)
は回答となる。
From the above, the combination with one candidate = answer, so
(Tsukidate Building, Tsukikan Building), (Fukushima Kawamata Building, Kawamata Building), (Fukuoka Hanazono Building, Hanazono Building), (Fukuyama Date Building, Date Building)
is the answer.
 そして、この「花園ビル」と「伊達ビル」の回答から、「藤田ビル」及び「梁川ビル」の候補は、
      「藤田ビル」に対し、候補:「立子山ビル」、「福井藤田ビル」
      「梁川ビル」に対し、候補:「立子山ビル」、「福地梁川ビル」
になる。ここで、「福岡花園ビル」と「福山伊達ビル」の候補がなくなったことから、
      パス:(桑原ビル、福山伊達ビル、立子山ビル、福岡花園ビル、保科ビル)
は、部分パスL1 1の閉路グラフGdにおける対応パスにはなりえない。よって、「藤田ビル」及び「梁川ビル」の候補から「立子山ビル」も除外されるので、
      (福井藤田ビル、藤田ビル)、(福地梁川ビル、梁川ビル)
を回答として得る。
And from the answers of "Hanazono Building" and "Date Building", the candidates of "Fujita Building" and "Yanagawa Building" are
Candidates for "Fujita Building": "Ritsukoyama Building" and "Fukui Fujita Building"
Candidates for "Yanagawa Building": "Ritsukoyama Building" and "Fukuchi Yanagawa Building"
become. Here, since there are no candidates for "Fukuoka Hanazono Building" and "Fukuyama Date Building",
Pass: (Kuwabara Building, Fukuyama Date Building, Ritsukoyama Building, Fukuoka Hanazono Building, Hoshina Building)
cannot be the corresponding path in the cycle graph G d of the partial path L 1 1 . Therefore, the "Ritsukoyama Building" is also excluded from the candidates of "Fujita Building" and "Yanagawa Building".
(Fukui Fujita Building, Fujita Building), (Fukuchi Yanagawa Building, Yanagawa Building)
as an answer.
 その後、プロセッサ101は、データメモリ103の一時記憶部1033に記憶された上記対応付けの結果に基づいて出力情報を生成し、データメモリ103の出力情報記憶部1034に記憶させる。図11は、この出力情報記憶部1034に記憶される出力情報の一例を示す図である。なお、ここでは、出力情報を名称データの対応関係を表す対応表として示しているが、これに限定されないことは勿論である。 After that, the processor 101 generates output information based on the result of the association stored in the temporary storage section 1033 of the data memory 103 and stores it in the output information storage section 1034 of the data memory 103 . FIG. 11 is a diagram showing an example of output information stored in this output information storage unit 1034. As shown in FIG. Although the output information is shown here as a correspondence table showing the correspondence of name data, it is of course not limited to this.
 以上より、名称データ対応付け装置により、部分パスを使用することで、正確な名称データの対応付けが可能であることが検証できた。 From the above, it was verified that it is possible to associate accurate name data with the name data association device by using a partial path.
 [他の実施形態]
 前記一実施形態では、対象とするデータベースも2つの場合を例に説明したが、3つ以上であっても良い。すなわち、3つ以上のデータベースのうち、少なくとも1つのデータベースがパス識別情報を保持していれば、残りの2つ以上のデータベースとの間で名称データの対応付けを行うことが可能となる。
[Other embodiments]
In the above-described embodiment, the case where the number of target databases is two has been described as an example, but the number of target databases may be three or more. That is, if at least one of three or more databases holds path identification information, name data can be associated with the remaining two or more databases.
 また、前記一実施形態では、パスを例に説明したが、パスではなく閉路である場合(始点と終点が同一の頂点)であっても対応可能なことは勿論である。 In addition, in the above-described embodiment, a path has been described as an example, but it is of course possible to deal with a closed path instead of a path (the starting point and the ending point are the same vertex).
 また、前記一実施形態では、データメモリ103の基礎データベース記憶部1031及び派生データベース記憶部1032に基礎データベース1及び派生データベース2が保持する情報の全部または一部を記憶して処理を進める例を説明したが、それに限定するものではない。プロセッサ101は、通信インタフェース104により外部のデータサーバに適宜アクセスして、そこに構築された基礎データベース1及び派生データベース2に蓄積されている情報を使用して処理を進め、各ステップの処理結果のみを一時記憶部1033に記憶するようにしても良い。これにより、名称データ対応付け装置が備えるデータメモリ103の容量を抑えることができ、安価に名称データ対応付け装置を構成することが可能となる。 Further, in the above-described embodiment, an example was explained in which all or part of the information held by the basic database 1 and the derived database 2 were stored in the basic database storage unit 1031 and the derived database storage unit 1032 of the data memory 103 to proceed with the processing. However, it is not limited to this. The processor 101 appropriately accesses an external data server through the communication interface 104, proceeds with processing using the information accumulated in the basic database 1 and the derivative database 2 constructed there, and obtains only the processing results of each step. may be stored in the temporary storage unit 1033 . As a result, the capacity of the data memory 103 included in the name data association device can be suppressed, and the name data association device can be configured at low cost.
 また、前記一実施形態では、出力情報を生成して、表示部108または外部のデータ処理装置に出力する例を説明したが、出力情報を生成することなく、一時記憶部1033に記憶した対応付け結果を出力するようにしても良い。これにより、名称データ対応付け装置が備えるデータメモリ103の容量を抑えることができ、安価に名称データ対応付け装置を構成することが可能となる。また、データベースの統合処理を行うデータ処理装置に対して、名称データの対応付けのみを行うサービスを提供することが可能となる。 Further, in the above-described embodiment, an example in which output information is generated and output to the display unit 108 or an external data processing device has been described. The result may be output. As a result, the capacity of the data memory 103 included in the name data association device can be suppressed, and the name data association device can be configured at low cost. Further, it is possible to provide a service for only associating name data with a data processing device that performs database integration processing.
 また、各実施形態に記載した手法は、計算機(コンピュータ)に実行させることができるプログラム(ソフトウェア手段)として、例えば磁気ディスク(フロッピー(登録商標)ディスク、ハードディスク等)、光ディスク(CD-ROM、DVD、MO等)、半導体メモリ(ROM、RAM、フラッシュメモリ等)等の記録媒体に格納し、また通信媒体により伝送して頒布することもできる。なお、媒体側に格納されるプログラムには、計算機に実行させるソフトウェア手段(実行プログラムのみならずテーブル、データ構造も含む)を計算機内に構成させる設定プログラムをも含む。本装置を実現する計算機は、記録媒体に記録されたプログラムを読み込み、また場合により設定プログラムによりソフトウェア手段を構築し、このソフトウェア手段によって動作が制御されることにより上述した処理を実行する。なお、本明細書でいう記録媒体は、頒布用に限らず、計算機内部あるいはネットワークを介して接続される機器に設けられた磁気ディスク、半導体メモリ等の記憶媒体を含むものである。 Further, the method described in each embodiment can be executed by a computer (computer) as a program (software means), such as a magnetic disk (floppy (registered trademark) disk, hard disk, etc.), an optical disk (CD-ROM, DVD , MO, etc.), a semiconductor memory (ROM, RAM, flash memory, etc.), or the like, or may be transmitted and distributed via a communication medium. The programs stored on the medium also include a setting program for configuring software means (including not only execution programs but also tables and data structures) to be executed by the computer. A computer that realizes this apparatus reads a program recorded on a recording medium, and in some cases, builds software means by a setting program, and executes the above-described processes by controlling the operation by this software means. The term "recording medium" as used herein is not limited to those for distribution, and includes storage media such as magnetic disks, semiconductor memories, etc. provided in computers or devices connected via a network.
 要するに、この発明は上記実施形態に限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で種々に変形することが可能である。また、各実施形態は可能な限り適宜組合せて実施しても良く、その場合組合せた効果が得られる。さらに、上記実施形態には種々の段階の発明が含まれており、開示される複数の構成要件における適当な組み合わせにより種々の発明が抽出され得る。 In short, the present invention is not limited to the above embodiments, and can be modified in various ways without departing from the gist of the invention at the implementation stage. Moreover, each embodiment may be implemented in combination as much as possible, and in that case, the combined effect can be obtained. Furthermore, the above-described embodiments include inventions at various stages, and various inventions can be extracted by appropriately combining a plurality of disclosed constituent elements.
 1…基礎データベース
 2…派生データベース
 3…グラフ作成部
 4…共通データ抽出部
 5…パス情報抽出部
 6…パス作成部
 7…対応付け部
 8…データ出力部
 101…プロセッサ
 102…プログラムメモリ
 103…データメモリ
 104…通信インタフェース
 105…入出力インタフェース
 106…バス
 107…入力部
 108…表示部
 1031…基礎データベース記憶部
 1032…派生データベース記憶部
 1033…一時記憶部
 1034…出力情報記憶部

 
REFERENCE SIGNS LIST 1 basic database 2 derivative database 3 graph creation unit 4 common data extraction unit 5 path information extraction unit 6 path creation unit 7 association unit 8 data output unit 101 processor 102 program memory 103 data Memory 104 Communication interface 105 Input/output interface 106 Bus 107 Input unit 108 Display unit 1031 Basic database storage unit 1032 Derived database storage unit 1033 Temporary storage unit 1034 Output information storage unit

Claims (8)

  1.  複数の名称データ及びそれら名称データの論理的または物理的な隣接関係を示す隣接情報を保持する第1のデータベースと、複数の名称データ、それら名称データの隣接情報及びそれら名称データが属するパスを表すパス識別情報を保持する第2のデータベースとの間で異なる表記を有する同義の名称データを対応付ける名称データ対応付け装置であって、
     前記第1のデータベースと前記第2のデータベースとの間で同じ表記である名称データを共通データとして抽出する共通データ抽出部と、
     前記第2のデータベースが保持する前記パス識別情報で表される前記パスから、前記共通データ抽出部が抽出した前記共通データを端点とし且つ非共通データを前記端点間の頂点とする部分パスを抽出し、前記第1のデータベースが保持する情報に基づいて、前記部分パスそれぞれについて、前記部分パスの前記端点と同じ共通データの端点を持ち且つ前記部分パスの長さ以上の長さを持つパスを作成するパス作成部と、
     前記パス作成部が抽出した前記部分パスそれぞれについて、前記部分パス上の各頂点と前記パス作成部が作成した前記パス上の頂点との組み合わせを探索することで、前記第1のデータベースが保持する前記名称データと前記第2のデータベースが保持する前記名称データとを対応付ける対応付け部と、
     を具備する、名称データ対応付け装置。
    A first database holding a plurality of name data and adjacency information indicating the logical or physical adjacency relationship of the name data, a plurality of name data, adjacency information of the name data, and a path to which the name data belongs A name data associating device that associates synonymous name data having a different notation with a second database holding path identification information,
    a common data extraction unit that extracts name data having the same notation between the first database and the second database as common data;
    Extracting from the path represented by the path identification information held by the second database, a partial path having the common data extracted by the common data extraction unit as endpoints and non-common data as vertices between the endpoints. Then, based on the information held by the first database, for each of the partial paths, a path having common data end points identical to the end points of the partial paths and having a length equal to or greater than the length of the partial paths is determined. a path creation part to be created;
    For each of the partial paths extracted by the path creation unit, by searching for combinations of vertices on the partial paths and vertices on the paths created by the path creation unit, the first database holds an associating unit that associates the name data with the name data held by the second database;
    A name data association device.
  2.  前記第1のデータベース及び前記第2のデータベースが保持する情報に基づいて、前記名称データを頂点とする前記第1のデータベース及び前記第2のデータベースの無向グラフを作成するグラフ作成部と、
     前記グラフ作成部が作成した前記第2のデータベースの前記無向グラフと前記第2のデータベースが保持する前記パス識別情報とに基づいて、前記共通データ抽出部が抽出した前記共通データを端点とし且つ前記第2のデータベースが保持する前記名称データを頂点とする全てのパスを生成し、それらのパスそれぞれについて、頂点数、含まれる頂点の名称データ及びそのパス上の位置を含むパス情報を抽出するパス情報抽出部と、
     をさらに具備し、
     前記パス作成部は、前記パス情報抽出部が生成したパスの1つに対し、前記パス情報に基づいて、前記グラフ作成部が作成した前記第2のデータベースの前記無向グラフから前記部分パスを抽出し、前記部分パスそれぞれについて、前記第1のデータベースの前記無向グラフから、前記部分パスの前記端点と同じ共通データの端点を持ち且つ前記部分パスが有する頂点の頂点数以上の頂点を含むパスを作成する、請求項1に記載の名称データ対応付け装置。
    a graph creation unit that creates an undirected graph of the first database and the second database with the name data as vertices based on the information held by the first database and the second database;
    using the common data extracted by the common data extraction unit as an end point based on the undirected graph of the second database created by the graph creation unit and the path identification information held by the second database; All paths having the name data held by the second database as vertices are generated, and path information including the number of vertices, the name data of the included vertices, and the position on the path is extracted for each of these paths. a path information extraction unit;
    further comprising
    The path creation unit extracts the partial path from the undirected graph of the second database created by the graph creation unit based on the path information for one of the paths generated by the path information extraction unit. extracting, for each of the partial paths, from the undirected graph of the first database, including vertices that have common data endpoints that are the same as the endpoints of the partial paths and that are equal to or greater than the number of vertices that the partial paths have; 2. The name data association device according to claim 1, which creates a path.
  3.  前記パス作成部は、前記パスとして、前記頂点数以上であり且つ前記頂点数に対してユーザが指定した個数以下の頂点数を含むパスを作成する、請求項2に記載の名称データ対応付け装置。 3. The name data association device according to claim 2, wherein said path creation unit creates, as said path, a path including a number of vertices equal to or greater than said number of vertices and equal to or less than a number specified by a user with respect to said number of vertices. .
  4.  前記対応付け部は、前記パス作成部が作成した前記パス上の前記頂点それぞれについて、
      前記パス上の位置が前記部分パス上の前記頂点に対応する場合には、前記第1のデータベースが保持する前記名称データのうちの前記パス上の前記頂点に対応する名称データを、前記第2のデータベースが保持する前記名称データのうちの前記部分パス上の前記頂点の前記名称データに対応付け、
      前記パス上の位置が前記部分パス上の前記頂点に対応しない場合には、名称データ同士の文字列類似度に基づいて、前記第1のデータベースが保持する前記名称データのうちの前記パス上の前記頂点に対応する名称データと、前記第2のデータベースが保持する前記名称データのうちの前記部分パス上の前記頂点の前記名称データとを対応付ける、
    請求項1乃至3のいずれかに記載の名称データ対応付け装置。
    For each of the vertices on the path created by the path creating unit, the associating unit:
    If the position on the path corresponds to the vertex on the partial path, the name data corresponding to the vertex on the path among the name data held by the first database is transferred to the second associated with the name data of the vertex on the partial path among the name data held by the database of
    If the position on the path does not correspond to the vertex on the partial path, based on the character string similarity between name data, associating the name data corresponding to the vertex with the name data of the vertex on the partial path among the name data held by the second database;
    4. The name data association device according to any one of claims 1 to 3.
  5.  前記パス作成部及び前記対応付け部は、前記パス情報抽出部が生成したパスの全てに対する処理が終了するまで、処理を繰り返す、請求項2または3に記載の名称データ対応付け装置。 The name data associating device according to claim 2 or 3, wherein the path creating unit and the associating unit repeat the processing until all the paths generated by the path information extracting unit are processed.
  6.  前記対応付け部による対応付けの結果に基づいて、名称データの対応表を含む出力情報を生成する出力部をさらに具備する、請求項1乃至5のいずれかに記載の名称データ対応付け装置。 The name data associating device according to any one of claims 1 to 5, further comprising an output unit that generates output information including a name data correspondence table based on the result of the associating by the associating unit.
  7.  プロセッサと、複数の名称データ及びそれら名称データの論理的または物理的な隣接関係を示す隣接情報を保持する第1のデータベースと、複数の名称データ、それら名称データの隣接情報及びそれら名称データが属するパスを表すパス識別情報を保持する第2のデータベースとを記憶したメモリと、を備え、前記第1のデータベースと前記第2のデータベースとの間で異なる表記を有する同義の名称データを対応付ける名称データ対応付け装置における名称データ対応付け方法であって、
     前記プロセッサにより、前記メモリに記憶されている前記第1のデータベースと前記第2のデータベースとの間で同じ表記である名称データを共通データとして抽出し、
     前記プロセッサにより、前記メモリに記憶されている前記第2のデータベースが保持する前記パス識別情報で表される前記パスから、前記抽出した前記共通データを端点とし且つ非共通データを前記端点間の頂点とする部分パスを抽出し、
     前記プロセッサにより、前記メモリに記憶されている前記第1のデータベースが保持する情報に基づいて、前記抽出した前記部分パスそれぞれについて、前記部分パスの前記端点と同じ共通データの端点を持ち且つ前記部分パスの長さ以上の長さを持つパスを作成し、
     前記プロセッサにより、前記抽出した前記部分パスそれぞれについて、前記部分パス上の各頂点と前記作成した前記パス上の頂点との組み合わせを探索することで、前記メモリに記憶されている前記第1のデータベースが保持する前記名称データと前記メモリに記憶されている前記第2のデータベースが保持する前記名称データとを対応付ける、
     名称データ対応付け方法。
    A processor, a first database holding a plurality of name data and adjacency information indicating a logical or physical adjacency relationship of the name data, a plurality of name data, adjacency information of the name data and the name data to which the name data belongs and a memory storing a second database holding path identification information representing a path, and name data for associating synonymous name data having different notations between the first database and the second database. A method for associating name data in an associating device,
    the processor extracts, as common data, name data having the same notation between the first database and the second database stored in the memory;
    The processor uses the extracted common data as endpoints and non-common data as vertices between the endpoints from the path represented by the path identification information held by the second database stored in the memory. and extract the partial path with
    By the processor, based on the information held by the first database stored in the memory, each of the extracted partial paths has an endpoint of common data that is the same as the endpoint of the partial path, and the partial path creates a path with a length equal to or greater than the length of the path,
    The first database stored in the memory is searched by the processor for each of the extracted partial paths for a combination of each vertex on the partial path and the created vertex on the path. associates the name data held by with the name data held by the second database stored in the memory;
    Name data matching method.
  8.  請求項1乃至6のいずれかに記載の名称データ対応付け装置の前記各部としてプロセッサを機能させる名称データ対応付けプログラム。

     
    7. A name data association program that causes a processor to function as each part of the name data association device according to any one of claims 1 to 6.

PCT/JP2021/021548 2021-06-07 2021-06-07 Name data association device, name data association method, and name data association program WO2022259303A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2021/021548 WO2022259303A1 (en) 2021-06-07 2021-06-07 Name data association device, name data association method, and name data association program
JP2023527147A JPWO2022259303A1 (en) 2021-06-07 2021-06-07

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/021548 WO2022259303A1 (en) 2021-06-07 2021-06-07 Name data association device, name data association method, and name data association program

Publications (1)

Publication Number Publication Date
WO2022259303A1 true WO2022259303A1 (en) 2022-12-15

Family

ID=84424985

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/021548 WO2022259303A1 (en) 2021-06-07 2021-06-07 Name data association device, name data association method, and name data association program

Country Status (2)

Country Link
JP (1) JPWO2022259303A1 (en)
WO (1) WO2022259303A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005011049A (en) * 2003-06-19 2005-01-13 Nec Soft Ltd Database integration device
JP2017123062A (en) * 2016-01-07 2017-07-13 富士通株式会社 Relation information generation method, device, and program
JP2020064417A (en) * 2018-10-16 2020-04-23 Nttテクノクロス株式会社 Management device, management method, and program

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005011049A (en) * 2003-06-19 2005-01-13 Nec Soft Ltd Database integration device
JP2017123062A (en) * 2016-01-07 2017-07-13 富士通株式会社 Relation information generation method, device, and program
JP2020064417A (en) * 2018-10-16 2020-04-23 Nttテクノクロス株式会社 Management device, management method, and program

Also Published As

Publication number Publication date
JPWO2022259303A1 (en) 2022-12-15

Similar Documents

Publication Publication Date Title
CN110837550B (en) Knowledge graph-based question answering method and device, electronic equipment and storage medium
JP2015062117A (en) Entity linkage method and entity linkage device
JP2021500692A (en) Genealogy entity resolution system and method
WO2016112782A1 (en) Method and system of extracting user living range
WO2011134141A1 (en) Method of extracting named entity
JP2019032704A (en) Table data structuring system and table data structuring method
CN113806579A (en) Text image retrieval method and device
WO2022259303A1 (en) Name data association device, name data association method, and name data association program
CN116450664A (en) Data processing method, device, equipment and storage medium
JP2023014025A (en) Method, computer program, and computer system (string similarity determination)
CN115329083A (en) Document classification method and device, computer equipment and storage medium
CN115082999A (en) Group photo image person analysis method and device, computer equipment and storage medium
WO2018220688A1 (en) Dictionary generator, dictionary generation method, and program
JP2019148859A (en) Device and method supporting discovery of design pattern in model development environment using flow diagram
JP6365070B2 (en) Search program, apparatus, and method
JP7392841B2 (en) Name data correspondence device, name data correspondence method and program
JP7392840B2 (en) Name data correspondence device, name data correspondence method and program
JP7276509B2 (en) Name data association device, name data association method and program
JP7105718B2 (en) Information processing device, information processing method, and program
JP2006004157A (en) Image retrieval program, image retrieval method, image retrieval device, and storage medium
JP4983397B2 (en) Document search apparatus, document search method, and computer program
JP2014044606A (en) Face recognition device
JP2018041281A (en) Retrieval device, method, and program
JP2022186543A (en) Data management system and data management method
JP2007172315A (en) System, method and program for creating synonym dictionary

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21944984

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023527147

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE