WO2017170459A1

WO2017170459A1 - Method, program, and system for automatic discovery of relationship between fields in environment where different types of data sources coexist

Info

Publication number: WO2017170459A1
Application number: PCT/JP2017/012496
Authority: WO
Inventors: ラフールアガワラ; 潔町田; 敏明中川
Original assignee: スマートインサイト株式会社
Priority date: 2016-03-31
Filing date: 2017-03-27
Publication date: 2017-10-05
Also published as: US20190317938A1; JP2017188137A; JP6964384B2

Abstract

The present invention efficiently associates fields for conducting a data analysis across the boundary of a plurality of data sources differing in design, and responds to various analyses. The present invention adds an index to the field of each of a plurality of data stores, determines the level of similarity between the fields, and generates data in graphical form (enterprise data graph) such that fields determined to have high similarity constitute a node. A feature such as a morphological analysis, etc., provided by a search engine may be used for the determination of similarity. Use of the enterprise data graph makes it possible to respond to various inquiry requests across the boundary between a plurality of data stores.

Description

Method, program, and system for automatic discovery of relationships between fields in a heterogeneous data source mixed environment

The present invention relates to a method, a program, and a system for analyzing data, and more particularly, a method, a program, and a method for automatically discovering and associating relationships between fields in a heterogeneous data source mixed environment. About the system.

Today's enterprises distribute data across several different types of data sources. For example, each department of a company (sales, service, shipping, etc.) may have its own data source. On the other hand, there is a growing need to integrate data from different data sources for reporting and analysis. Here, the data source is a general term for technologies for storing and providing data, and is typically a database, but is not limited thereto. An example of the data source will be described later. Data sources are also called data repositories, data stores, data storage, etc. in some contexts

However, in the conventional system, it is difficult to determine where the data is located and how the fields in the plurality of data sources are related to each other. For example, even for products of the same model, the product code may differ between the sales department database and the service department database. In different data sources, the field names of related data may be different. Sometimes it was meaningless to map fields between different data sources. In addition, each data source often has a different data model designed by another data designer. In addition, the data in the data source is not always clean (eg, there may be missing data, inaccurate data, or incorrect formatting). The same input items may be stored in different formats depending on the data source. Considering that the amount of data in a company changes from terabytes to petabytes, the conventional system often cannot easily determine the relationship between tables (entities) and fields in different data sources.

As a technique for solving such a problem, a technique called EII (Enterprise Information Integration) for constructing a virtual database by combining a plurality of data sources is known (for example, Patent Document 1, Non-Patent Document). 1), however, it is still difficult to associate fields between data sources with different designs, and sufficient effects have not been achieved.

Patent publication gazette JP2000-222430

A method, a program, and a system for efficiently performing association between fields for performing data analysis across a plurality of data sources having different designs are provided.

The present invention is a method for analyzing data in a plurality of data stores, the step of eliminating duplication from a set of character strings in a plurality of fields in a table in the plurality of data stores, and the elimination of the duplication A stored character string in a transposed index, a step of determining similarity between the character strings stored in the transposed index, and a plurality of fields based on the determined similarity between the character strings. Executed by a computer including a step of determining similarity and a step of generating graph-format data in which the similarity relationship is expressed by an edge using a table including fields determined to have high similarity between the plurality of fields as a node. This method is addressed by providing a method.

According to the present invention, the step of determining the similarity between the character strings stored in the transposed index further includes a step of dividing the character strings in the plurality of fields by applying morphological analysis, and the character strings This problem is addressed by providing a method as described in paragraph 0008, including the step of obtaining a cosine similarity between and applying a logistics function to the cosine similarity.

In the present invention, the step of determining the similarity between the character strings stored in the transposed index further includes a step of storing a set of character strings in the plurality of fields in a temporary table; The problem is addressed by providing the method of paragraph 0008 including the step of applying a natural join operation and the step of calculating the similarity between the tables.

In addition, the present invention further includes a step of determining similarity between the fields based on attributes of data in a plurality of fields in a table in the plurality of data stores, wherein the attributes include concentrations, individual values, and the like. The method described in Paragraph 0008, Paragraph 0009, or Paragraph 0010, which is any one or more of the number of sigma, the boundary of the histogram, the number of null values, or the number of non-null values, addresses the above problem. .

The invention of the present application is a computer-implemented method using the graph format data created by the method according to paragraph 0008, paragraph 0009, paragraph 0010, or paragraph 0011, wherein the first data store Receiving a query for a first field of a first table of the first table, identifying a second field similar to the first field based on the graph-format data, and the second field; The problem is addressed by providing a method including a step of displaying at least one of a second table including the second field and a second data store including the second table.

The invention of the present application is a computer-implemented method using the graph format data created by the method according to paragraph 0008, paragraph 0009, paragraph 0010, or paragraph 0011, wherein the first data store Displaying information about, receiving a query for a first field of a first table in the first data store, and second similar to the first field based on the graph-format data Identifying information on the first data store in response to the query, and displaying information relating to the second data store including the second field, and displaying the information relating to the first data store in response to the query. And updating the display of information about in parallel.

The invention of the present application is a method executed by a computer using the graph format data created by the method described in paragraph 0008, paragraph 0009, paragraph 0010, or paragraph 0011, and is stored in the first data store. Receiving a query for a first field of the first table of the first table, identifying a second field similar to the first field based on the graph-format data, and for the first data store Combining the query result with the query result for a second data store including a second table including the second field and displaying on the same screen. Respond to challenges.

Further, the present invention is a method executed by a computer using the transposed index created by the method described in paragraph 0008, paragraph 0009, paragraph 0010, or paragraph 0011, wherein the information about the first data store A step of displaying
Receiving a query for a first field of a first table in the first data store; identifying a second field similar to the first field based on the graphical data; Displaying information related to the second data store including the second field, displaying information related to the first data store in response to the query, and displaying information related to the second data store in parallel To address the above-mentioned problem.

Provided are a method, a program, and a system for efficiently performing association between fields for performing data analysis across a plurality of data sources having different designs.

FIG. 3 is a diagram illustrating examples of various data sources in an enterprise environment. It is a figure showing the process of enterprise data graph preparation performed by the data management analysis apparatus of the Example which concerns on this invention. It is a 1st example of the algorithm which judges the similarity between the fields of the table concerning this invention. It is a 2nd example of the algorithm which judges the similarity between the fields of the table which concerns on this invention. It is a typical expression figure of the example of an enterprise data graph concerning the invention in this application. It is a figure showing the example of a screen display of the enterprise data graph in the Example which concerns on this invention. It is a figure showing the process of the 1st Example of the query using an enterprise data graph performed by the data management analysis apparatus of the Example which concerns on this invention. It is a figure showing the process of the 2nd Example of the query using an enterprise data graph performed by the data management analyzer of the Example which concerns on this invention. It is a figure showing the process of the 3rd Example of the query using an enterprise data graph performed by the data management analysis apparatus of the Example which concerns on this invention. It is a figure showing the process of the 4th Example of the query using an enterprise data graph performed by the data management analyzer of the Example which concerns on this invention. It is a figure showing the functional outline of the Example of the data management analyzer which concerns on this invention. FIG. 1 illustrates an example of a computing environment that includes a computer device suitable for use in an embodiment according to the present invention.

Detailed Description of the Invention

Embodiments according to the present invention will be described below with reference to the drawings. Details are omitted for clarity. All figures are exemplary.

FIG. 1 shows examples of various data sources in an enterprise environment (100). Data used in the enterprise environment (100) is provided from various types of data sources (105-135). One type of data source is a search engine (105). Search engines (105) (eg, SOLR and ELASTICSEARCH) use a transposed index for high-speed search using character strings divided by tokenizers and n-grams, and are useful for storing and searching text data It is. The inverted index stores terms and keywords mapped to a plurality of documents (107). Each document may correspond to one record including one or more attribute values. Some fields of each record may be indexed for efficient search.

Another type of data source is a relational database (also referred to as a relational database management system (RDBMS)) (110). The relational database (110) stores data in tables (113), each table representing an entity in the system, and specific fields (also called attributes) representing relationships between entities. The relationship may be one to one, one to many, or many to many. RDBMS can be queried by Structured Query Language (SQL) and provides a robust and mature mechanism for storing and querying structured data. RDBMS typically uses a B-tree data structure for efficient reading from disk and inserting data into a relational database (110). A B-tree is a data structure that sorts and maintains data so that it can be searched, inserted, and deleted in logarithmic time (log N).

Another type of data source is the columner database (115). The columnar database (115) is similar to the relational database (110), but is characterized by storing data in a column (118) orientation rather than in a row oriented structure. The columner database (115) enables efficient search processing of data having a small number of columns (118) in a large number of rows. Many analytical queries require that some columns (118) be aggregated, in which case the columner database (115) is advantageous in terms of storage and retrieval efficiency. It is. This is because column-oriented storage requires fewer disk readings for reading a specific column (118). AMAZON REDSHIFT is an example of a columnar database (115), and APACHE PARQUET is an example of a column-oriented file format.

Another type of data source is a key-value database (also called key-value store) (120). The key value store (120) uses a data storage scheme designed to store, retrieve and manage the associative array (123). The associative array (123) is a data structure commonly referred to as a dictionary or hash. In the key / value store (120) (for example, RIAK, REDIS, MEMCACHE), high-speed data search is possible based on the key. The key value store (120) may be implemented as a map data structure on disk or memory. The key value store (120) can follow a consistency model based on serialization and eventual consistency. Key access to the key value store (120) can be executed with O (1) order (constant) complexity.

Another type of data source is a web service (125). The web service (125) serves as a common integration interface to various data systems inside and outside the enterprise environment (100). For example, most social media data is accessed via REST (Representational State Transfer) API (application programming interface). External cloud-based applications (160) (Salesforce.com, Google Analytics, etc.) can also be accessed via the REST API. The web service (125) provides real-time request-response access to data stored by the service provider.

Another type of data source is a file system stored on shared storage in an enterprise environment. In the file system, in addition to structured file data such as CSV / Excel, there are many unstructured data such as proposals and design documents. The storage in which the file is stored may exist in another document management system in addition to a general shared folder.

Another type of data source is a live streaming data source (130) that provides live data generated or received in real time. Live data is often provided from a real-time stream from a socket (such as Kinesis or Kafka). Stream processing can serve the dual purpose of lambda architecture (batch processing and real-time interactive processing). In the latter case, the delay requirement may be less than 1 second. In some cases, reprocessing is performed when a stream event is missing.

Another type of data source is “big data” (103). Big data data sources store large amounts of data (more than terabytes) in a clustered environment such as Hadoop or Spark. Big data typically provides a SQL-like query language. A real-time model (eg, SPARKSQL or Impala) or an asynchronous model (eg, HIVE) may be employed.

Another type of data source is the data mart (104). Data marts are often analytical cubes for online analytical processing (OLAP) or analysis services (eg, SQL Server Analysis Services (SSAS)). An analytical query (eg, SQL-based) may be issued to query the data mart (140). These analytical queries may be pre-aggregated data or cleaned data and are used to meet various reporting needs.

Another type of data source type is the business object layer (145). Business objects (eg, SAP, Infomatica) are often front-end applications that allow business users to view, categorize and analyze business intelligence data. These front-end application layers may be queried directly using specific APIs.

A plurality of types of data sources (105-145) may be communicably connected to each other via a wired connection or a wireless connection network (150). Furthermore, the data management analyzer (155) may be connected to the network (150). The data management analyzer (155) may be a computer (eg, laptop or desktop), a mobile device (eg, a smartphone, a wearable device (eg, smartwatch), and a server computer. The data management analysis device (155) may include a computing environment (900) as shown in FIG.

In order to deal with the above problem, the data management analyzer (155) can execute the process for data analysis described below. The embodiment mainly uses a relational database data source as an example, but is applicable to other types of data sources.

FIG. 2 shows an example (200) of a process executed by the data management / analysis apparatus (155) to determine the similarity of fields between different data sources and generate an enterprise data graph. Here, an enterprise data graph is a relationship between multiple data sources, a relationship between multiple data models within one or more data sources, or a relationship between entities stored in one or more data sources. Is a graph expressing In the following examples, enterprise data graphs representing relationships between entities will be described, but the same idea applies to relationships between data sources and data models. An entity refers to a real-world thing represented by a database (data source) and corresponds to a table or view (virtual table) in a relational database (in the following description, an entity and a table are synonymous). And the table includes a view). A field refers to a column in the table.

(205) A heuristic index for extracting words from the character string format field of each table in each data source of the plurality of data sources is prepared. Here, the heuristic index is used to extract meaningful words in text contained in the string format field of the table in the data source, and how often the word is in which field of which table of which data source. This index is used to point to whether or not it exists, and can be implemented using the transposed index function provided by search engines such as Lucene.

(210) Next, the similarity between the character string format fields stored in the heuristic index is determined by the similarity determination algorithm. Two types of specific algorithm examples are described below.

FIG. 3 shows a first example of the similarity determination algorithm. In this example, the relevance is determined by the similarity of the character strings included in the field. The feature is that the matching is determined in consideration of partial matching and fluctuation due to tokenization, not perfect matching. For example, even if there are differences in expressions for the same term depending on the data source (for example, “External device” and “External device”, “Co., Ltd.” and “Patent”), similar fields can be found. Has an advantage. The first example algorithm consists of the following steps: (1) The character string type field of each table is identified, and all values held for each field are acquired. (2) A DISTINCT operation is applied to the acquired value set to eliminate duplication (analysis focusing on “type” instead of “number of values”). (3) The duplicated value set is fed to a search engine that can perform morphological analysis (for example, Apache Lucene / 形態 Solr). The character string is divided into sequences by token decomposition or n-gram processing by a morphological analyzer. How to divide the character string can be set by changing the schema on the search engine side according to the user's needs. In the search engine index structure, an index containing Bag of Words is formed. One fed character string corresponds to one document. (4) Find cosine similarity between fields. (5) Although the cosine similarity is a real value between 0 and 1, a logistic function is applied because there is a non-linear relationship between the calculated value and the similarity felt by humans. It is preferable to increase the fluctuation sensitivity around 0.5 mm and loosen the sensitivity around 0.0 and 1.0. It is preferable that the parameters of the logistic function can be changed in the setting file. (6) It is preferable to set an attribute such as High / Medium / Low / None based on a predetermined threshold for the calculated similarity value between fields.

FIG. 4 shows a second example of the similarity determination algorithm. In this example, the relevance is determined based on the degree to which a character string is included in the field. This method does not absorb ambiguous fluctuations, and makes a determination with a perfect match, and includes the following steps. (1) A character string type field is identified from each data model, and all values held for each field are acquired. (2) A DISTINCT operation is applied to the acquired value set to eliminate duplication (focus on “type” instead of “number of values”). (3) The duplicated value set is held as a temporary table of 1 column × n rows. An in-memory DBMS may be used for temporary tables. (4) In the temporary table group of 1 column × n rows generated in step 3 above, a join operation (natural join) is performed. The number of rows in this result set is compared with the number of records in the original two tables. (5) As a comparison method, three methods of a Dice coefficient, a Simpson coefficient, and a Jaccard coefficient are applied, and respective similarities are obtained. The three values are combined after weighting, and a real value (similarity) having a range from 0 to 1 is calculated. (6) It is preferable to discard all temporary table groups of 1 column × n rows at the stage where the similarity can be calculated for all the combinations of tables. (7) It is preferable to set an attribute such as High / Medium / Low / None for the calculated similarity value between fields based on a predetermined threshold.

(215) Further, the attributes of each field (not limited to the character string field) of each table of each data store (for example, concentration (cardinality), number of individual values (NDV), histogram range, number of null values, non-values, etc. The number of null values) may be collected and stored.

(220) Determine the similarity of each table in the different data stores based on either or both of the generated heuristic index and the collected field attributes. The similarity determined by the heuristic index and the similarity determined by the field attribute may be weighted and averaged, and may be determined to be similar when a predetermined threshold is exceeded. When the similarity determined by the heuristic index exceeds a predetermined threshold, it may be determined that they are similar regardless of the similarity determined by the field attribute. When the similarity determined by the field attribute exceeds a predetermined threshold, it may be determined that they are similar regardless of the similarity determined by the heuristic index. If the degree of similarity determined by the heuristic index is equal to or less than a predetermined threshold, it may be determined that they are not similar regardless of the degree of similarity determined by the field attribute. When the degree of similarity determined by the field attribute is equal to or less than a predetermined threshold, it may be determined that they are not similar regardless of the degree of similarity determined by the heuristic index. It is preferable that the determination method and the predetermined threshold value can be set as parameters by the user or described as a script.

(225) The similarity between the tables is determined based on the determined similarity between the fields. It is preferable to treat a table including many fields determined to be similar as having strong similarity.

(230) An enterprise data graph is generated based on the determined similarity between the tables. FIG. 5 shows a schematic representation of an example of an enterprise data graph according to the present invention. An enterprise data graph node is a table (entity) that includes fields that belong to different data sources but are determined to be similar, and edges represent the similarity between the tables. The degree of similarity between tables may be set according to the number of similar fields and the degree of similarity. In this example, each node is a table, but similar relationships between data sources and data models may be expressed in the same way.

(235) The generated enterprise data graph is preferably displayed in graphical form on the screen to enhance understanding of the relationship between the user's data and to support queries across data sources. .

FIG. 6 shows a screen display example of the enterprise data graph (600) in the embodiment according to the present invention. As shown in the figure, the enterprise data graph (600) is an area of a start data store (605) (here, "Complaints") that is a starting point of a mapping process between data stores (data repositories). 610). This area (610) may contain information about the starting data store (605) (eg, the number of fields and links (607) to confirm the contents of the fields and tables). This starting data store (605) may be automatically selected based on the query received from the user.

In addition, the enterprise data graph (600) is an area that illustrates similarity relationships between the plurality of data stores (620-655) determined by the process of FIG. 2 to be similar to the start data store (605). (615) may also be included. In the figure, the starting data store (605) (in this example, “Complaints”) has eight other data stores: “Supplier” (620), “Blue” (625), “Sales” (630), It is determined to be similar to “Recalls” (635), “Investigations” (640), “Parts” (645), “Reviews” (650), and “BOM” (655). The technology and design (data model) of these multiple data stores are not necessarily the same, and the contents of the stored data are not always completely consistent, but the similarity between the fields mentioned above The judgment algorithm can graphically represent the relationship between tables (entities) and data stores (data repositories), and can support user data analysis work.

Further, the enterprise data graph (600) includes a table determined to have fields similar to the user selected data store (in this example, “Supplier” (620) is selected). An area (665) indicating the data store (625-660) may also be included.

Some of these data stores (625-660) have been determined to also contain similar fields in the starting data repository (605). For example, “Blue” (625), “Sales” (630), “Recalls” (635), “Investigations” (640), “Investigations” (645), and “Reviews” (650) are all areas. (615) and region (665).

However, some of the data stores shown in region (665) may not be shown in region (615). In this case, it means that these data stores do not contain fields similar to the fields in the table in the start data store (605). In addition, some of the data stores shown in region (815) may not be shown in region (665), in which case these data stores are tables in user-selected data store (620). Means that it does not contain fields similar to those in For example, “Call log” (660) is shown only in region (665) and therefore does not include a field similar to the field in the table in start data store (605). Similarly, “BOM” ′ (655) is shown only in region (615) and therefore does not include fields similar to those in the table in user selected data store (620).

The enterprise data graph (600) includes an information area (670) that provides information about the starting data store (605) (eg, the number of fields) and a link to display the fields in the starting data store (605) ( 872) and one or more of them may be included.

Further, the enterprise data graph (600) includes an information area (675) that provides information (eg, number of fields) about the user selected data store (620) and a link (677) to display the fields. May be.

Further, the enterprise data graph (600) may also include an area (680) that provides information regarding the relationship between the starting data store (605) and the user selected data store (620). This region (680) may include a list (685) of fields in the starting data store (605) that have been determined to be similar to fields in the user selected data store (620). The region (680) may also include a link (690) for displaying a field in the start data store (605) that has been determined to be similar to a field in the user selected data store (620).

In the following, various embodiments of various queries (data query requests) using either or both of the enterprise data graph and / or heuristic index generated by the process of FIG. 2 will be described.

FIG. 7 illustrates a process (700 for a first embodiment of a query using an enterprise data graph (referred to herein as a data chain) that may be performed by an embodiment of the data management and analysis apparatus (155) of the application. ).

(705) A query for a field of a specific table in the data store is received from a user.

(710) The enterprise data graph is used to identify fields in other tables similar to the table queried.

(715) Optionally, the heuristic index may be re-queried to identify similar fields to the user's query target and display to the user.

(720) When a similar field and a table including the field are identified, a confirmation message is displayed to the user, the user input is received, and the screen display is changed to a screen displaying a data store other than the data store A. To do. At this time, a plurality of tables may be displayed to allow the user to select one.

(725) The field identified at 710 or 715 is used to allow the user to query the corresponding data store and display the results.

This embodiment is effective when, for example, the shipment details and production performance data stored in different data stores are traced and analyzed using the serial number of the product inquired at the call center. At this time, there is an advantage in that it is possible to reach a data store or a table to be obtained only by partial matching even if the word does not completely match like a normal RDBMS table-to-table relationship.

FIG. 8 shows a second example of a query using an enterprise data graph (referred to as a federated query or a federal query) that can be executed by the data management analyzer (155) according to an embodiment of the present application. The process (800) is shown.

(805) Display a screen representing data of a plurality of data stores to the user, and allow the user to select one of these data stores and input a query.

(810) Receive a query for one of the data stores.

(815) The enterprise data graph is used to identify tables in other data stores that are similar to the subject table of the query received at 810 and the fields it contains.

(820) Optionally, the heuristic index may be re-queried to identify and display to the user the fields associated with the user's query.

(825) Issue queries to each data store in parallel based on the identified fields and tables.

(830)
In response to the parallel query, the display on the screen corresponding to each data store is updated. For example, if a user sends a query that narrows the query period to a certain data store, an equivalent query is issued to other data stores, and the screen display of each can be changed. User convenience can be improved.

FIG. 9 illustrates a third example of a query using an enterprise data graph (real-time data fusion, virtual integration, or composite) that can be performed by a data management analyzer (155) according to an embodiment of the present application. (Referred to as data model) process (900).

(905) Receive a query for a particular table in the data store.

(910) Identify similar fields of tables in other data stores based on the enterprise data graph.

(915) Optionally, field choices identified as similar may be displayed to the user for selection by the user.

(920) The field requested in 905 and the field identified in 910 are combined to generate a query result. This embodiment allows two or more physically different data stores to be treated like a single data store. For example, when certain sales data is stored in a separate database or table for each region, the analysis can be performed as a single table without physical integration, improving user convenience.

FIG. 10 illustrates a process (1000) of a fourth embodiment of a query using a heuristic index (referred to as an Mμgen search) that can be performed by a data management and analysis apparatus (155) according to an embodiment of the present invention. .

(1005) A character string (keyword) that is a search target input by the user is received. At this time, autocomplete (suggestion) may be performed by searching the heuristic index.

(1010) A heuristic index is searched based on the received keyword, and a table including a field including the keyword and a list including the table are generated and displayed to the user.

(1015) Receive input from a user selecting a table or field from the list.

(1020) A screen (dashboard) corresponding to the selected table or field is displayed, an inquiry request from the user is received, and the result is displayed. According to this embodiment, it is possible to exhaustively search a wide variety of data stores and find data stores that may be related to a specific keyword. For example, by inputting a car model name as a keyword and searching for related recall-related information, supply chain-related information, and production system-related information across the board, it is possible to quickly deal with complaints about that vehicle type.

In addition to the embodiments described above, enterprise data graphs and heuristic indexes can be used to accommodate various queries across multiple data stores. For example, all (or some) values contained in a particular field of the current search result can be used to search across other data stores. In the method according to the present invention, since the relationship between tables is extracted as a heuristic index and an enterprise data graph, the number of data stores does not require a large load on the system for accessing the data stores. Even in a large-scale system with a large number of data, efficient data inquiry and analysis becomes possible.

FIG. 11 shows a functional schematic diagram of an embodiment of the data management analysis apparatus (155), and shows an outline of the execution method of the processes and queries shown in FIGS. The data management analysis device (155) may include a user interface layer (1105) that receives query parameters (Params) from the user and presents the query results to the user. The user interface layer (1105) receives the query and sends it to the query engine (1110). The query engine (1110) includes a query transformer (1120), a query pipeline (1125), a cry executor (1155), a result pipeline (1160), and a result combiner (1165).

The query transformer (1120) is a generic search object for retrieving queries from the user interface layer (1105) via various form elements and passing them to the query transformer (1120) within the query engine (1110). (1115). The search object (1115) contains all metadata (data model, field, data repository, filter) related to the query retrieved from the internal metadata (1130). The role of the query transformer (1120) in the federated search is to search the enterprise data graph (1135) and identify similar fields. The query transformer (1120) transforms the search object (1115) based on the enterprise data graph (1135). A search object (1115) is then provided to the query pipeline (1125) to determine the order of queries to be executed by the query pipeline (1125) and to perform transformations such as security (1175). To (1115). Prior to executing the query, the query executor (1155) may check whether the result of the query exists in the query result cache (1150). If the query is for one data source, the query executor (1155) may execute directly.

For compound queries (eg, queries against multiple data stores (1140, 1145)), the entire query execution flow may be recorded in internal metadata (1130) along with corresponding statistics. For example, statistics such as query conversion time, physical query execution time, total network transfer transfer time, query template / format for which the workflow was executed may be recorded.

The query executor (1155) takes the search object (1115) and converts it into an executable query using the native data engine API or language. Pushdowns may be used whenever possible to improve query execution efficiency. The query executor (1155) then returns a data structure containing the query results to the result pipeline (1160).

The query executor (1155) may return multiple query results as a result pipeline (1160) that coordinates the sequence of results. Transformations based on proprietary logic may be performed at this layer.

If the combined query invokes multiple queries for different physical data sources, the result combiner (1165) may be invoked to join or combine intermediate results at each join point of the query. For example, the result combiner (1165) may accept the intermediate results, determine the best join strategy, and then return the join results for both intermediate data sets. For “big data” class joins between datasets, the result combiner (765) implements a distributed join strategy by performing calculations using a clustered in-memory engine (eg, APACHE SPARK). Can be used. Such a query may be executed in real time, and may be executed asynchronously if the processing in the clustered environment includes an overhead that is greater than the network overhead.

If the amount of the intermediate data set is small, the join can be executed using the in-memory SQL engine. In some embodiments, sufficient speed for real-time execution can be realized by executing SQL after batch insertion into the RAM disk table.

In some embodiments, the join fetches each intermediate result and then applies the appropriate join algorithm to generate a result object (1170), resulting in two data sources at the result joiner (1165). The query results may be returned to the user interface layer (1105). In the result combiner (1165), each intermediate result is obtained and applied with an appropriate join algorithm to generate a result object (1170) for return through the user interface layer (1105), so that the join It may be performed across two data sources.

Many join strategies can be used to merge the results. In some embodiments, an in-memory SQL engine may be used. For example, the intermediate data set may be written to a POSTGRESQL or MYSQL table on a RAM disk (which may reside on another server on the same network). A SQL join query (which may be used with a calculation) may then be executed on the database to generate the final result set. The final result set may be sent back to the result joiner (1165). Similarly, other in-memory SQL engines (eg, MemSQL) may be used. The infrastructure running the in-memory server may have a large amount of memory. Similarly, other columnar databases may be used. When a complex query that is expensive or the performance of the target data source is poor, write to a columnar database with excellent query performance to provide comfortable performance to users.

In other embodiments, it may be desirable to do as much processing as possible on the native data source. By pushing down search, comparison, local join, sorting, aggregation, and grouping to a lower data source, you can take advantage of the data source's capabilities, transferred over the network, and processed by an in-memory engine The amount of intermediate data can be limited.

As another example, data query caching may be used. For example, a cache cluster (eg, REDIS, MEMCACHED) may be configured to store the results for each query. You may check for any results in the cache before querying any store. Only when the result is not in the cache is the data source accessed. Performance can be improved when queries are frequently executed against sources with high latency. For engines that do not have their own caching (for example, IMPALA) or compound queries (for example, if the query is split into multiple physical queries and the results are merged by an in-memory SQL engine), the performance improvement is significant. There is a possibility.

In other embodiments, concurrent processing may be performed. For example, if a query spans many data sources, parallel processing may reduce query execution time. Further, in some embodiments, mutually exclusive queries may be identified and executed in parallel threads during query execution plan determination.

In other embodiments, distributed processing may be performed. For example, the native data source's distributed processing capabilities may be used inherently while executing queries using specific engines (eg, HIVE, IMPALA) on the Hadoop platform or Spark platform. However, in some implementations, other ecosystem tools may be required to decentrally join large intermediate sets on clustered engines such as Spark and Hadoop.

In addition, non-real time optimal query plan generation may be used. For example, the query engine (1110) can record the execution time of each stage in the query execution workflow. This log may include query processing for specific data model data sources, network forwarding and data merging. This log is useful as data for further optimizing the query plan for subsequent query workflow execution. To avoid the overhead of searching the optimal execution graph, this optimization process runs as a background process that determines the optimal execution plan and caches it for reuse in the internal metadata store (1130). Also good.

(Example of computing environment)
FIG. 12 illustrates an example computing environment (1200) that includes a computing device (1205) suitable for implementing a particular embodiment. The computing device (1205) in the computing environment (1200) may include one or more processing units, cores, or processors (1210), memory (1215) (eg, RAM, ROM, etc.), internal storage (1220). (Eg, magnetic disk, optical disk, semiconductor storage, organic storage), I / O interface (1225), which are connected on a communication mechanism or bus (1230) for the exchange of information. It may be embedded in a computing device (1205).

The computing device (1205) may be communicatively connected to an input user interface (1235) and an output device interface (1240). Either or both of the input user interface (1235) and the output device interface (1240) may be wired or wireless and may be removable. The input user interface (1235) includes any device, component, or sensor that can provide input, whether physical or virtual (eg, buttons, touch screen interface, keyboard, pointer, cursor control, microphone, braille) , Motion sensor, optical reader, etc.). The output device interface (1240) includes a display, a television, a monitor, a printer, a speaker, Braille, and the like. In some embodiments, the input user interface (1235) and the output device interface (1240) may be embedded in or physically connected to the computing device (1205). In other embodiments, other computing devices may provide the functionality of the input user interface (1235) and output device interface (1240) of the computing device (1205).

The computing device (1205) is a mobile device (eg, tablet, notebook, laptop, personal computer, portable TV, radio, etc.), a highly portable device (eg, smartphone, vehicle and other mechanical devices) , Devices carried by humans and animals), computers (eg, desktop computers, server devices, other computers, information kiosks, televisions with one or more processors embedded therein, and / or televisions coupled to them, radios Etc.), but is not limited to this.

The computing device (1205) may communicate with any number of network components, devices, and systems, including one or more homogeneous or heterogeneous computing devices, external storage (1245) and network ( 1250) (eg, via an I / O interface (1225)). The computing device (1205), or any connected computing device, functions as a server, client, thin server, general machine, special purpose machine, or other termed computer. , Can provide services or refer to them.

The I / O interface (1225) is any communication or I / O protocol or standard (eg, Ethernet) for a network for communicating information between connected components, devices, and networks in the computer environment 900. (Registered trademark), 802.11x, universal system bus, WiMAX, modem, cellular network protocol, and the like), but not limited thereto. The network (950) may be any network (eg, the Internet, a local area network, an area network, a telephone network, a cellular network, a satellite network) or a combination thereof.

The computing device (1205) may be used and communicated using computer usable or computer readable media including temporary and non-temporary media. Temporary media include transmission media (eg, metal cables, optical fibers), signals, carrier waves, and the like. Non-transitory media include magnetic media (eg, disks and tapes), optical media (eg, CD-ROM, digital video disc, Blu-ray disc), semiconductor media (eg, RAM, ROM, flash memory, solid state storage) , And other non-volatile storage devices or memories.

The computing device (1205) may be used to implement techniques, methods, applications, processes, or computer-executable instructions in some computing environment embodiments. Computer-executable instructions can be retrieved from temporary media, stored on non-transitory media, and retrieved therefrom. Executable instructions may consist of one or more of programming, scripting, and machine language (eg, C, C ++, C #, Java, Visual Basic, Python, Perl, JavaScript, etc.).

The processor (1210) may execute under any operating system (OS) (not shown) in a native or virtual environment. Logical unit (1255), API (application programming interface) unit (1260), input unit (1265), output unit (1270), heuristic index generation unit (1275), similarity determination unit (1280), enterprise data graph The generation unit (1285), the analytic graphics unit (1290), and the inter-unit communication mechanism (1295) communicate with the OS and other applications (not shown) to communicate with each other. For example, the heuristic index generation unit (1275), the similarity determination unit (1280), the enterprise data graph generation unit (1285), and the analytic graphics unit (1290) may include one or more of the ones shown in FIGS. Processing can be implemented. The described units and elements can be changed in design, function, configuration, or implementation, and are not limited to the content of the description.

In some embodiments, the API unit (1260) receives one or more execution instructions and receives one or more other units (eg, heuristic index generation unit (1275), similarity determination unit (1280), enterprise May communicate with a data graph generation unit (1285), an analysis graphics unit (1290), and a query unit (1297), for example, a heuristic index is generated via a heuristic index generation unit (1275); When done, a heuristic index may be provided to the similarity determination unit (1280) to determine the similarity between the different entities. 285) may generate the enterprise data graph, the similarity determination unit (1280) may provide the similarity data, and the enterprise data graph generation unit (1285) may further analyze the graphics (1290). The data analysis graphic may be generated and displayed using the output unit (1270).

In some embodiments, the logical unit (1255) controls the information flow between the units and includes an API unit (1260), an input unit (1265), an output unit (1270), a heuristic index generation unit (1275), A similarity determination unit (1280), an enterprise data graph generation unit (1285), and an analysis graphics unit (1290) are included. For example, the flow or implementation of one or more processes can be controlled by the logical unit (1255) alone or in conjunction with the API unit (1260). The query unit (1297) executes a query in each data repository in cooperation with the API unit (1260), the logical unit (1255), and the similarity determination unit (980).

Several embodiments have been described and are provided to convey the subject matter of the present invention to those skilled in the art. It should be noted that the subject matter of the present invention is not limited to the described embodiments and can be realized in various forms. The subject matter of the present invention may be implemented without the components defined or described herein, or may be implemented with other components not described herein. It will be readily apparent to those skilled in the art that modifications may be made to the embodiments without departing from the subject matter of the invention as defined by the claims.

Claims

A method for analyzing data in multiple data stores,
Removing duplicates from a set of strings in a plurality of fields in a table in the plurality of data stores;
Storing the duplicated string in an inverted index;
Determining similarity between character strings stored in the transposed index;
Determining similarity between the plurality of fields based on the determined similarity between character strings;
Generating a graph-format data in which a similarity relationship is expressed by an edge using a table including fields determined to have high similarity between the plurality of fields as a node.
The step of determining the similarity between the character strings stored in the inverted index further includes:
Applying morphological analysis to character strings in the plurality of fields and dividing the character strings;
Obtaining a cosine similarity between the strings;
Applying a logistic function to the cosine similarity.
The step of determining the similarity between the character strings stored in the inverted index further includes:
Storing a set of strings in the plurality of fields in a temporary table;
Applying a natural join operation to the table;
The method of claim 1, comprising calculating a similarity between the tables.
Further comprising determining similarity between the fields based on attributes of data in a plurality of fields in a table in the plurality of data stores;
The said attribute is any one or more of a density | concentration, the number of individual values, the boundary of a histogram, the number of null values, or the number of non-null values, The Claim 1, Claim 2, or Claim 3 Method.
A computer-implemented method using the graphical data generated by the method of claim 1, claim 2, claim 3 or claim 4, comprising:
Receiving a query for a first field of a first table in a first data store;
Identifying a second field similar to the first field based on the graph-format data;
Displaying at least one of the second field, a second table including the second field, and a second data store including the second table.
A computer-implemented method using the graphical data generated by the method of claim 1, claim 2, claim 3 or claim 4, comprising:
Displaying information about the first data store;
Receiving a query for a first field of a first table in the first data store;
Identifying a second field similar to the first field based on the graph-format data;
Displaying information about a second data store including the second field;
Updating the display of information about the first data store and the display of information about the second data store in parallel in response to the query.
A computer-implemented method using the graphical data generated by the method of claim 1, claim 2, claim 3 or claim 4, comprising:
Receiving a query for a first field of a first table in a first data store;
Identifying a second field similar to the first field based on the graph-format data;
Combining the result of the query for the first data store and the result of the query for a second data store including a second table including the second field on the same screen. Method.
A computer-implemented method using the transposed index created by the method of claim 1, claim 2, claim 3, or claim 4, comprising:
Receiving keywords from users,
Searching a table including a field including the keyword from the transposed index;
Displaying the field or the table.