WO2023164294A1 - Query splitter for an inverted index datastore - Google Patents

Query splitter for an inverted index datastore Download PDF

Info

Publication number
WO2023164294A1
WO2023164294A1 PCT/US2023/014141 US2023014141W WO2023164294A1 WO 2023164294 A1 WO2023164294 A1 WO 2023164294A1 US 2023014141 W US2023014141 W US 2023014141W WO 2023164294 A1 WO2023164294 A1 WO 2023164294A1
Authority
WO
WIPO (PCT)
Prior art keywords
query
sql
node
columnar data
abstract syntax
Prior art date
Application number
PCT/US2023/014141
Other languages
French (fr)
Inventor
Anup BADHE
Gautam MARYA
Alexander PINZON
Original Assignee
Kinesso, LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kinesso, LLC filed Critical Kinesso, LLC
Publication of WO2023164294A1 publication Critical patent/WO2023164294A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/221Column-oriented storage; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2452Query translation
    • G06F16/24526Internal representations for queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists

Definitions

  • An inverted index datastore is used by many applications that need fast counts based on complex Boolean queries consisting of textual or numerical data present in a document or series of documents. It stores a mapping of words (or any other type of search terms) to their locations in the database table or document.
  • the inverted index is a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of documents. As such, it is the opposite of a forward index, which maps from documents to content.
  • the inverted index data structure allows efficient, full-text searches in a database. It is an important part of information retrieval systems and search engines. Although an inverted index allows fast full-text searches, it does so at a cost of increased processing when a document is added to the database since the indexes must be updated as each document is added.
  • an inverted index datastore may be used to get fast counts based on textual and numerical values present in a document or documents, but the application may also require support for extraction of data based on those fast counts, for which an inverted index data structure is not suitable.
  • the extraction process can take a large amount of time based on the volume of documents indexed and the complexity of the query. It would be desirable to develop a method and system to extract the data in a shorter time but still preserve the semantics of the query.
  • the fastest data extraction mechanism currently available uses Apache Spark and requires columnar data that is organized and indexed properly. Therefore, a solution allowing conversion from inverted index query (e.g., JavaScript Object Notation or JSON format) to a columnar data query (e.g., Structured Query Language or SQL) format would be desirable.
  • the present invention is directed to a method and system to convert an inverted index query (such as, but not limited to, a JavaScript Object Notation or JSON query) based on inverted index data structure to an equivalent flattened columnar data query (such as, but not limited to, Structured Query Language or SQL query), and execute it over a cluster (such as, but not limited to, Apache Spark) in order to extract large volumes of data in the fastest amount of time.
  • an inverted index query such as, but not limited to, a JavaScript Object Notation or JSON query
  • an equivalent flattened columnar data query such as, but not limited to, Structured Query Language or SQL query
  • an application user may submit a JSON query request via an http endpoint.
  • the JSON query is converted to an Abstract Syntax Tree (AST), which is a tree representation of code.
  • AST Abstract Syntax Tree
  • Each node in the AST is converted to a SQL operator or function using a conversion table between JSON elements and SQL elements.
  • a SPARK-SQL WHERE clause may be generated using a pre-order traversal algorithm operated over the AST.
  • a SPARK-SQL SELECT clause may be created using the required columns specified by the user.
  • a temporal view may be created in SPARK from the input data specified by the user.
  • the SPARK SQL context may be used to process the SPARK-SQL query that is thus generated. The results of the execution of the previous query are then saved in the output path specified by the user.
  • the processing just described can support many applications. These include exporting of data to a data warehouse; media planning; syndication; audience composition and profiling reports; and look-alike modeling.
  • Fig. 1 is a flow chart for a method according to an embodiment of the present invention.
  • Fig. 2 is an example of columnar data representation according to an embodiment of the present invention.
  • Fig. 3 is an example of joined data representation according to an embodiment of the present invention.
  • Fig. 4 illustrates the conversion of a JSON query into an Abstract Syntax
  • Fig. 5 illustrates the conversion of the nodes in an Abstract Syntax Tree into SQL operations or functions using a conversion table.
  • Fig. 6 illustrates a SPARK-SQL query in tree structure form according to an embodiment of the present invention.
  • the data pertains to colors used by the teams of various cities participating in various sports.
  • Fig. 2 provides a key 30 for this data, using a columnar data form of representation, while Fig. 3 shows the joined data representation 32 including the four different users (Carla, Julia, Joseph, and David) and their corresponding color/city/sport from the code provided above.
  • An example JSON query shown at step 10 in Fig. 1 , may be constructed for this inverted index, as defined in the code below:
  • “must” refers to terms that must appear in matching documents, analogous to an AND in Boolean logic, whereas “should” refers to terms where at least one must match, analogous to an OR in Boolean logic.
  • the example JSON query is automatically restructured by the system implementing this method as an Abstract Syntax Tree (AST).
  • AST is a tree representation of text, source code, or other similar source materials in a formal tree structure, with each node of the tree denoting a construct occurring in the source text.
  • An example is provided in the illustration of Fig. 4, where a JSON query 34 is shown on the left and the AST 36 that will be constructed from JSON query 34 is shown on the right, the bool query itself becomes the top-level node in the tree, denoted as Qb 38.
  • the two operators “must” and “should” are rendered as child nodes from Qb 38, denoted as Qm 40 and Q s 42, respectively.
  • each of the two terms for each of the two operators are rendered as child nodes for the corresponding operator node, with four total nodes being constructed at this level of the tree.
  • the terms “Sport”:4 and “City”:4 are child nodes of Q m 40 as q1 44 and q2 46, respectively
  • the terms “Color”:6 and “Color”:7 are child nodes of Q s 42 as q3 48 and q4 50, respectively.
  • each node in AST 36 is converted by the system to a SQL operator in function.
  • SPARK-SQL is used, although the invention is not so limited.
  • Table 1 may be used as a conversion table in this example: Table 1
  • Qm 40 is converted to SPARK-SQL operator AND 56 q3 48 is converted to SPARK-SQL function array_contains(Color, 6) 64 q4 50 is converted to SPARK-SQL function array_contains(Color, 7) 66
  • the result of this conversion may now be used by the system to automatically generate a node list with post-order tree traversal, i.e., LRN order.
  • a SPARK-SQL WHERE clause is generated from this process.
  • the ordered output of that process will be: q1 , q2, Q m , q3, q4, Q s , Qb
  • the final SPARK-SQL query may be automatically generated by the system using a recursive function from a SPARK-SQL SELECT clause as follows:
  • the resulting SPARK-SQL query representing in a tree-structure SQL query 68, shown in Fig. 6, represents the same JSON query 34 from Fig. 4.
  • the top-level node is and 70, corresponding to AND 54 from SPARK-SQL tree 52.
  • Second-level nodes and 72 and or 74 correspond to AND 56 and OR 58, respectively.
  • step 20 the system automatically estimates the size of the computing cluster necessary in order to process the query based on the columns used in the previous steps.
  • this cluster is created, and at step 24 the processing of the query begins by reading the data from the input location.
  • the SQL query is executed at the cluster for this input data, and then at step 28 the results of the processing are output to the location indicated by the user.
  • the user output from the method just described may be useful in a number of applications.
  • data extracted by the system following this method can be used to load the data into a warehouse for analysis and audience segmentation using SQL. Segmentation is a process whereby records are divided into distinct groups with similar characteristics.
  • the data extracted by the system can be used to load an audience segment into a media planning system. For instance, it can be used for design and planning of messaging using tools centered around channel selection, reach customization, and optimization.
  • Another application is syndication, whereby content is republished at different locations in order to expand an audience size for the messaging. In order to facilitate syndication, internal lists of message recipients must be translated into standardized formats accessible by third parties who will perform the syndication.
  • This syndication process relies heavily on the ability to gather and deliver long lists of identifiers (potentially from tens of millions even up to billions of identifiers) based on a set of attributes that describe the group of message recipients.
  • the syndication system oversees translating and delivering lists of potential recipients to multiple online platforms, and so it may rely on the system performing the process herein described in order to gather the list of internal identifiers of potential recipients that comply with a set of logical clauses.
  • the system as described herein may be used to generate various types of reports.
  • the extracted data can be used to generate an audience composition report that provides information on the characteristics of a specific group or audience.
  • This report may include data on demographic factors as well as psychographic factors pertaining to an audience.
  • the extracted data can be used to generate an audience profiling report that provides a detailed overview of a specific audience that is the intended target audience of a message or series of messages.
  • This report typically includes a variety of information, such as demographic characteristics, psychographic traits, and habits. The report may be used to better understand an audience and create more effective messaging tailored to the specific needs and interests of that audience.
  • look-alike modeling a group of recipients who share similar characteristics to an existing target audience is identified, and then that information is used to find new, potential message recipients. In this way, lookalike modeling may expand the audience of a message or series of messages.
  • the systems and methods described herein may in various embodiments be implemented by any combination of hardware and software.
  • the systems and methods may be implemented by a computer system or a collection of computer systems, each of which includes one or more processors executing program instructions stored on a computer-readable storage medium coupled to the processors.
  • the program instructions may implement the functionality described herein.
  • the various systems and displays as illustrated in the figures and described herein represent example implementations. The order of any method may be changed, and various elements may be added, modified, or omitted.
  • a computing system or computing device as described herein may implement a hardware portion of a cloud computing system or non-cloud computing system, as forming parts of the various implementations of the present invention.
  • the computer system may be any of various types of devices, including, but not limited to, a commodity server, personal computer system, desktop computer, laptop or notebook computer, mainframe computer system, handheld computer, workstation, network computer, a consumer device, application server, storage device, telephone, mobile telephone, or in general any type of computing node, compute node, compute device, and/or computing device.
  • the computing system includes one or more processors (any of which may include multiple processing cores, which may be single or multi-threaded) coupled to a system memory via an input/output (I/O) interface.
  • the computer system further may include a network interface coupled to the I/O interface.
  • the computer system may be a single processor system including one processor, or a multiprocessor system including multiple processors.
  • the processors may be any suitable processors capable of executing computing instructions. For example, in various embodiments, they may be general-purpose or embedded processors implementing any of a variety of instruction set architectures. In multiprocessor systems, each of the processors may commonly, but not necessarily, implement the same instruction set.
  • the computer system also includes one or more network communication devices (e.g., a network interface) for communicating with other systems and/or components over a communications network, such as a local area network, wide area network, or the Internet.
  • a client application executing on the computing device may use a network interface to communicate with a server application executing on a single server or on a cluster of servers that implement one or more of the components of the systems described herein in a cloud computing or non-cloud computing environment as implemented in various subsystems.
  • a server application executing on a computer system may use a network interface to communicate with other instances of an application that may be implemented on other computer systems.
  • the computing device also includes one or more persistent storage devices and/or one or more I/O devices.
  • the persistent storage devices may correspond to disk drives, tape drives, solid state memory, other mass storage devices, or any other persistent storage devices.
  • the computer system (or a distributed application or operating system operating thereon) may store instructions and/or data in persistent storage devices, as desired, and may retrieve the stored instruction and/or data as needed.
  • the computer system may implement one or more nodes of a control plane or control system, and persistent storage may include the SSDs attached to that server node. Multiple computer systems may share the same persistent storage devices or may share a pool of persistent storage devices, with the devices in the pool representing the same or different storage technologies.
  • the computer system includes one or more system memories that may store code/instructions and data accessible by the processor(s).
  • the system’s memory capabilities may include multiple levels of memory and memory caches in a system designed to swap information in memories based on access speed, for example.
  • the interleaving and swapping may extend to persistent storage in a virtual memory implementation.
  • the technologies used to implement the memories may include, by way of example, static random-access memory (RAM), dynamic RAM, read-only memory (ROM), non-volatile memory, or flashtype memory.
  • RAM static random-access memory
  • ROM read-only memory
  • flashtype memory non-volatile memory
  • multiple computer systems may share the same system memories or may share a pool of system memories.
  • System memory or memories may contain program instructions that are executable by the processor(s) to implement the routines described herein.
  • program instructions may be encoded in binary, Assembly language, any interpreted language such as Java, compiled languages such as C/C++, or in any combination thereof; the particular languages given here are only examples.
  • program instructions may implement multiple separate clients, server nodes, and/or other components.
  • program instructions may include instructions executable to implement an operating system (not shown), which may be any of various operating systems, such as UNIX, LINUX, SolarisTM, MacOSTM, or Microsoft WindowsTM. Any or all of program instructions may be provided as a computer program product, or software, that may include a non-transitory computer-readable storage medium having stored thereon instructions, which may be used to program a computer system (or other electronic devices) to perform a process according to various implementations.
  • a non-transitory computer-readable storage medium may include any mechanism for storing information in a form (e.g., software, processing application) readable by a machine (e.g., a computer).
  • a non-transitory computer- accessible medium may include computer-readable storage media or memory media such as magnetic or optical media, e.g., disk or DVD/CD-ROM coupled to the computer system via the I/O interface.
  • a non-transitory computer-readable storage medium may also include any volatile or non-volatile media such as RAM or ROM that may be included in some embodiments of the computer system as system memory or another type of memory.
  • program instructions may be communicated using optical, acoustical or other form of propagated signal (e.g., carrier waves, infrared signals, digital signals, etc.) conveyed via a communication medium such as a network and/or a wired or wireless link, such as may be implemented via a network interface.
  • a network interface may be used to interface with other devices, which may include other computer systems or any type of external electronic device.
  • system memory, persistent storage, and/or remote storage accessible on other devices through a network may store data blocks, replicas of data blocks, metadata associated with data blocks and/or their state, database configuration information, and/or any other information usable in implementing the routines described herein.
  • the I/O interface may coordinate I/O traffic between processors, system memory, and any peripheral devices in the system, including through a network interface or other peripheral interfaces.
  • the I/O interface may perform any necessary protocol, timing or other data transformations to convert data signals from one component (e.g., system memory) into a format suitable for use by another component (e.g., processors).
  • the I/O interface may include support for devices attached through various types of peripheral buses, such as a variant of the Peripheral Component Interconnect (PCI) bus standard or the Universal Serial Bus (USB) standard, for example.
  • PCI Peripheral Component Interconnect
  • USB Universal Serial Bus
  • some or all of the functionality of the I/O interface such as an interface to system memory, may be incorporated directly into the processor(s).
  • a network interface may allow data to be exchanged between a computer system and other devices attached to a network, such as other computer systems (which may implement one or more storage system server nodes, primary nodes, read-only node nodes, and/or clients of the database systems described herein), for example.
  • the I/O interface may allow communication between the computer system and various I/O devices and/or remote storage.
  • Input/output devices may, in some embodiments, include one or more display terminals, keyboards, keypads, touchpads, scanning devices, voice or optical recognition devices, or any other devices suitable for entering or retrieving data by one or more computer systems.
  • the user interfaces described herein may be visible to a user using various types of display screens, which may include CRT displays, LCD displays, LED displays, and other display technologies.
  • the inputs may be received through the displays using touchscreen technologies, and in other implementations the inputs may be received through a keyboard, mouse, touchpad, or other input technologies, or any combination of these technologies.
  • similar input/output devices may be separate from the computer system and may interact with one or more nodes of a distributed system that includes the computer system through a wired or wireless connection, such as over a network interface.
  • the network interface may commonly support one or more wireless networking protocols (e.g., Wi-Fi/IEEE 802.11 , or another wireless networking standard).
  • the network interface may support communication via any suitable wired or wireless general data networks, such as other types of Ethernet networks, for example.
  • the network interface may support communication via telecommunications/telephony networks such as analog voice networks or digital fiber communications networks, via storage area networks such as Fibre Channel SANs, or via any other suitable type of network and/or protocol.
  • a read-write node and/or readonly nodes within the database tier of a database system may present database services and/or other types of data storage services that employ the distributed storage systems described herein to clients as network-based services.
  • a network-based service may be implemented by a software and/or hardware system designed to support interoperable machine-to-machine interaction over a network.
  • a web service may have an interface described in a machine-processable format, such as the Web Services Description Language (WSDL).
  • WSDL Web Services Description Language
  • Other systems may interact with the network-based service in a manner prescribed by the description of the network-based service’s interface.
  • the network-based service may define various operations that other systems may invoke, and may define a particular application programming interface (API) to which other systems may be expected to conform when requesting the various operations.
  • API application programming interface
  • a network-based service may be requested or invoked through the use of a message that includes parameters and/or data associated with the network-based services request.
  • a message may be formatted according to a particular markup language such as Extensible Markup Language (XML), and/or may be encapsulated using a protocol such as Simple Object Access Protocol (SOAP).
  • SOAP Simple Object Access Protocol
  • a network-based services client may assemble a message including the request and convey the message to an addressable endpoint (e.g., a Uniform Resource Locator (URL)) corresponding to the web service, using an Internet-based application layer transfer protocol such as Hypertext Transfer Protocol (HTTP).
  • URL Uniform Resource Locator
  • HTTP Hypertext Transfer Protocol
  • network-based services may be implemented using Representational State Transfer (REST) techniques rather than message-based techniques.
  • REST Representational State Transfer
  • a network-based service implemented according to a REST technique may be invoked through parameters included within an HTTP method such as PUT, GET, or DELETE.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An inverted index query is converted to an equivalent columnar data query and executed over a compute cluster in order to extract large volumes of data in the fastest amount of time. Specifically, a user's JSON query may be converted to an Abstract Syntax Tree (AST). Each node in the AST is converted to a SQL operator or function using a conversion table between JSON elements and SQL elements. A node list is generated with post-order tree traversal, and the final SQL query is generated with a recursive function. A cluster size is determined, and the compute cluster is created to perform processing using the SQL query. Input data is read, the query is executed for the input data, and the results are exported to the output path specified by the user.

Description

QUERY SPLITTER FOR AN INVERTED INDEX DATASTORE CROSS-REFERENCE TO RELATED APPLICATIONS
[0001 ] This application claims the benefit of U.S. provisional patent application no. 63/314,704, filed on February 28, 2022. Such application is incorporated herein by reference in its entirety.
BACKGROUND
[0002] An inverted index datastore is used by many applications that need fast counts based on complex Boolean queries consisting of textual or numerical data present in a document or series of documents. It stores a mapping of words (or any other type of search terms) to their locations in the database table or document. The inverted index is a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of documents. As such, it is the opposite of a forward index, which maps from documents to content. The inverted index data structure allows efficient, full-text searches in a database. It is an important part of information retrieval systems and search engines. Although an inverted index allows fast full-text searches, it does so at a cost of increased processing when a document is added to the database since the indexes must be updated as each document is added.
[0003] In certain data processing applications, an inverted index datastore may be used to get fast counts based on textual and numerical values present in a document or documents, but the application may also require support for extraction of data based on those fast counts, for which an inverted index data structure is not suitable. The extraction process can take a large amount of time based on the volume of documents indexed and the complexity of the query. It would be desirable to develop a method and system to extract the data in a shorter time but still preserve the semantics of the query. The fastest data extraction mechanism currently available uses Apache Spark and requires columnar data that is organized and indexed properly. Therefore, a solution allowing conversion from inverted index query (e.g., JavaScript Object Notation or JSON format) to a columnar data query (e.g., Structured Query Language or SQL) format would be desirable.
[0004] Any references mentioned in this background section are not admitted to be prior art with respect to the present invention.
SUMMARY
[0005] The present invention is directed to a method and system to convert an inverted index query (such as, but not limited to, a JavaScript Object Notation or JSON query) based on inverted index data structure to an equivalent flattened columnar data query (such as, but not limited to, Structured Query Language or SQL query), and execute it over a cluster (such as, but not limited to, Apache Spark) in order to extract large volumes of data in the fastest amount of time.
[0006] For example, in certain implementations, an application user may submit a JSON query request via an http endpoint. The JSON query is converted to an Abstract Syntax Tree (AST), which is a tree representation of code. Next, based on the complexity of the query, a cluster size is computed on which the processing will take place. Each node in the AST is converted to a SQL operator or function using a conversion table between JSON elements and SQL elements. A SPARK-SQL WHERE clause may be generated using a pre-order traversal algorithm operated over the AST. A SPARK-SQL SELECT clause may be created using the required columns specified by the user. A temporal view may be created in SPARK from the input data specified by the user. The SPARK SQL context may be used to process the SPARK-SQL query that is thus generated. The results of the execution of the previous query are then saved in the output path specified by the user.
[0007] The processing just described can support many applications. These include exporting of data to a data warehouse; media planning; syndication; audience composition and profiling reports; and look-alike modeling.
[0008] These and other features, objects and advantages of the present invention will become better understood from a consideration of the following detailed description of the preferred embodiments and appended claims in conjunction with the drawings as described following:
DRAWINGS
[0009] Fig. 1 is a flow chart for a method according to an embodiment of the present invention.
[0010] Fig. 2 is an example of columnar data representation according to an embodiment of the present invention.
[0011 ] Fig. 3 is an example of joined data representation according to an embodiment of the present invention.
[0012] Fig. 4 illustrates the conversion of a JSON query into an Abstract Syntax
Tree according to an embodiment of the present invention. [0013] Fig. 5 illustrates the conversion of the nodes in an Abstract Syntax Tree into SQL operations or functions using a conversion table.
[0014] Fig. 6 illustrates a SPARK-SQL query in tree structure form according to an embodiment of the present invention.
DETAILED DESCRIPTION OF CERTAIN EMBODIMENTS
[0015] Before the present invention is described in further detail, it should be understood that the invention is not limited to the particular embodiments described, and that the terms used in describing the particular embodiments are for the purpose of describing those particular embodiments only, and are not intended to be limiting, since the scope of the present invention will be limited only by the claims.
[0016] An embodiment of the invention may be illustrated with reference to a particular simplified example, in conjunction with the drawings as described herein. This example will use a representation of inverted index documents as defined in the code below:
{
"_index": "User_records", "_id": "yudC4H0Bc8n2URPxNYma", "_source": {"User": "Carla", "Color": [9, 4, 6, 2], "City": [6, 2, 9, 3], "Sport": [2, 4, 1 ,6]}
},
{
"_index": "User_records", "_id": "y-dC4H0Bc8n2URPxT4kA", "_source": {"User": "Julia", "Color": [1 ,5, 8, 7], "City": [8, 9, 3, 4], "Sport":
[7, 3, 2, 6]}
},
{
"_index": "User_records", "_id": "zOdC4HOBc8n2URPxZokD", "_source": {"User": "Joseph", "Color": [1 ,5, 8, 3], "City": [9,5,1 ,11 ], "Sport": [3, 4, 1 ,5]}
},
{
"_index": "User_records", "_id": "zedC4H0Bc8n2URPxcYng",
"_source": {"User": "David", "Color": [2,10,9,4], "City": [6,10,4,7], "Sport": [7, 3,6, 5]}
}
In this simple example, the data pertains to colors used by the teams of various cities participating in various sports. Fig. 2 provides a key 30 for this data, using a columnar data form of representation, while Fig. 3 shows the joined data representation 32 including the four different users (Carla, Julia, Joseph, and David) and their corresponding color/city/sport from the code provided above.
An example JSON query, shown at step 10 in Fig. 1 , may be constructed for this inverted index, as defined in the code below:
{ bool": { 'must1': [
{"term": {"Sport": 4}},
{"term": {"City": 9}}
"should": [
{"term": {"Color": 6}}, {"term": {"Color": 7}}
}
}
In the bool query, “must” refers to terms that must appear in matching documents, analogous to an AND in Boolean logic, whereas “should” refers to terms where at least one must match, analogous to an OR in Boolean logic.
[0017] At step 12 of Fig. 1 , the example JSON query is automatically restructured by the system implementing this method as an Abstract Syntax Tree (AST). An AST is a tree representation of text, source code, or other similar source materials in a formal tree structure, with each node of the tree denoting a construct occurring in the source text. An example is provided in the illustration of Fig. 4, where a JSON query 34 is shown on the left and the AST 36 that will be constructed from JSON query 34 is shown on the right, the bool query itself becomes the top-level node in the tree, denoted as Qb 38. The two operators “must” and “should” are rendered as child nodes from Qb 38, denoted as Qm 40 and Qs 42, respectively. Further down, each of the two terms for each of the two operators are rendered as child nodes for the corresponding operator node, with four total nodes being constructed at this level of the tree. Thus the terms “Sport”:4 and “City”:4 are child nodes of Qm 40 as q1 44 and q2 46, respectively, and the terms “Color”:6 and “Color”:7 are child nodes of Qs 42 as q3 48 and q4 50, respectively. In table form then, the nodes in the AST may be rendered as follows: q1 = {"term": {"Sport": 4}} q2 = {"term": {"City": 9}} q3 = {"term": {"Color": 6}} q4 = {"term": {"Color": 7}}
Qm = must
Qs = should
Qb = bool
[0018] Next, at step 14 of Fig. 1 , each node in AST 36 is converted by the system to a SQL operator in function. In this example, SPARK-SQL is used, although the invention is not so limited. Table 1 may be used as a conversion table in this example:
Figure imgf000009_0001
Table 1
Using exemplary AST 36 from Fig. 4, the following results are then obtained by applying this conversion, as shown in Fig. 5 for SPARK-SQL tree 52: q1 44 is converted to SPARK-SQL function array_contains(Sport, 4) 60 q2 46 is converted to SPARK-SQL function array_contains(City, 9) 62
Qm 40 is converted to SPARK-SQL operator AND 56 q3 48 is converted to SPARK-SQL function array_contains(Color, 6) 64 q4 50 is converted to SPARK-SQL function array_contains(Color, 7) 66
Qs 42 is converted to SPARK-SQL operator OR 58
Qb 38 is converted to SPARK-SQL operator AND 54
[0019] At step 16 of Fig. 1 , the result of this conversion may now be used by the system to automatically generate a node list with post-order tree traversal, i.e., LRN order. A SPARK-SQL WHERE clause is generated from this process. In this example, the ordered output of that process will be: q1 , q2, Qm, q3, q4, Qs, Qb
[0020] At step 18 of Fig. 1 , the final SPARK-SQL query may be automatically generated by the system using a recursive function from a SPARK-SQL SELECT clause as follows:
SELECT User
FROM User_records
WHERE
(array_contains(Sport, 4) AND array_contains(City, 9))
AND (array_contains(Color, 6) OR array_contains(Color, 7))
The resulting SPARK-SQL query representing in a tree-structure SQL query 68, shown in Fig. 6, represents the same JSON query 34 from Fig. 4. In treestructure query 68, the top-level node is and 70, corresponding to AND 54 from SPARK-SQL tree 52. Second-level nodes and 72 and or 74 correspond to AND 56 and OR 58, respectively. Likewise, third-level nodes q1 Sport =4 76, q2 City = 9 78, q3 Color = 6 80, and q4 Color = 7 82, correspond to q1 array_contains (Sport, 4) 60, q2 array_contains (City, 9) 62, q3 array_contains (Color, 6) 64, and q4 array_contains (Color, 7) 66, respectively.
[0021 ] Next, at step 20 as shown in Fig. 1 , the system automatically estimates the size of the computing cluster necessary in order to process the query based on the columns used in the previous steps. At step 22, this cluster is created, and at step 24 the processing of the query begins by reading the data from the input location. At step 26, the SQL query is executed at the cluster for this input data, and then at step 28 the results of the processing are output to the location indicated by the user.
[0022] The user output from the method just described may be useful in a number of applications. For example, data extracted by the system following this method can be used to load the data into a warehouse for analysis and audience segmentation using SQL. Segmentation is a process whereby records are divided into distinct groups with similar characteristics. In addition, the data extracted by the system can be used to load an audience segment into a media planning system. For instance, it can be used for design and planning of messaging using tools centered around channel selection, reach customization, and optimization. Another application is syndication, whereby content is republished at different locations in order to expand an audience size for the messaging. In order to facilitate syndication, internal lists of message recipients must be translated into standardized formats accessible by third parties who will perform the syndication. Because multiple messaging campaigns may be run at the same time for different users and each with a different intended audience, it may be necessary to translate and deliver multiple independent lists of such references. This syndication process relies heavily on the ability to gather and deliver long lists of identifiers (potentially from tens of millions even up to billions of identifiers) based on a set of attributes that describe the group of message recipients. The syndication system oversees translating and delivering lists of potential recipients to multiple online platforms, and so it may rely on the system performing the process herein described in order to gather the list of internal identifiers of potential recipients that comply with a set of logical clauses.
[0023] The system as described herein may be used to generate various types of reports. For example, the extracted data can be used to generate an audience composition report that provides information on the characteristics of a specific group or audience. This report may include data on demographic factors as well as psychographic factors pertaining to an audience. In addition, the extracted data can be used to generate an audience profiling report that provides a detailed overview of a specific audience that is the intended target audience of a message or series of messages. This report typically includes a variety of information, such as demographic characteristics, psychographic traits, and habits. The report may be used to better understand an audience and create more effective messaging tailored to the specific needs and interests of that audience.
[0024] Yet another application of the system performing the method described herein is look-alike modeling. In look-alike modeling, a group of recipients who share similar characteristics to an existing target audience is identified, and then that information is used to find new, potential message recipients. In this way, lookalike modeling may expand the audience of a message or series of messages.
[0025] The systems and methods described herein may in various embodiments be implemented by any combination of hardware and software. For example, in one embodiment, the systems and methods may be implemented by a computer system or a collection of computer systems, each of which includes one or more processors executing program instructions stored on a computer-readable storage medium coupled to the processors. The program instructions may implement the functionality described herein. The various systems and displays as illustrated in the figures and described herein represent example implementations. The order of any method may be changed, and various elements may be added, modified, or omitted.
[0026] A computing system or computing device as described herein may implement a hardware portion of a cloud computing system or non-cloud computing system, as forming parts of the various implementations of the present invention. The computer system may be any of various types of devices, including, but not limited to, a commodity server, personal computer system, desktop computer, laptop or notebook computer, mainframe computer system, handheld computer, workstation, network computer, a consumer device, application server, storage device, telephone, mobile telephone, or in general any type of computing node, compute node, compute device, and/or computing device. The computing system includes one or more processors (any of which may include multiple processing cores, which may be single or multi-threaded) coupled to a system memory via an input/output (I/O) interface. The computer system further may include a network interface coupled to the I/O interface.
[0027] In various embodiments, the computer system may be a single processor system including one processor, or a multiprocessor system including multiple processors. The processors may be any suitable processors capable of executing computing instructions. For example, in various embodiments, they may be general-purpose or embedded processors implementing any of a variety of instruction set architectures. In multiprocessor systems, each of the processors may commonly, but not necessarily, implement the same instruction set. The computer system also includes one or more network communication devices (e.g., a network interface) for communicating with other systems and/or components over a communications network, such as a local area network, wide area network, or the Internet. For example, a client application executing on the computing device may use a network interface to communicate with a server application executing on a single server or on a cluster of servers that implement one or more of the components of the systems described herein in a cloud computing or non-cloud computing environment as implemented in various subsystems. In another example, an instance of a server application executing on a computer system may use a network interface to communicate with other instances of an application that may be implemented on other computer systems.
[0028] The computing device also includes one or more persistent storage devices and/or one or more I/O devices. In various embodiments, the persistent storage devices may correspond to disk drives, tape drives, solid state memory, other mass storage devices, or any other persistent storage devices. The computer system (or a distributed application or operating system operating thereon) may store instructions and/or data in persistent storage devices, as desired, and may retrieve the stored instruction and/or data as needed. For example, in some embodiments, the computer system may implement one or more nodes of a control plane or control system, and persistent storage may include the SSDs attached to that server node. Multiple computer systems may share the same persistent storage devices or may share a pool of persistent storage devices, with the devices in the pool representing the same or different storage technologies.
[0029] The computer system includes one or more system memories that may store code/instructions and data accessible by the processor(s). The system’s memory capabilities may include multiple levels of memory and memory caches in a system designed to swap information in memories based on access speed, for example. The interleaving and swapping may extend to persistent storage in a virtual memory implementation. The technologies used to implement the memories may include, by way of example, static random-access memory (RAM), dynamic RAM, read-only memory (ROM), non-volatile memory, or flashtype memory. As with persistent storage, multiple computer systems may share the same system memories or may share a pool of system memories. System memory or memories may contain program instructions that are executable by the processor(s) to implement the routines described herein. In various embodiments, program instructions may be encoded in binary, Assembly language, any interpreted language such as Java, compiled languages such as C/C++, or in any combination thereof; the particular languages given here are only examples. In some embodiments, program instructions may implement multiple separate clients, server nodes, and/or other components.
[0030] In some implementations, program instructions may include instructions executable to implement an operating system (not shown), which may be any of various operating systems, such as UNIX, LINUX, Solaris™, MacOS™, or Microsoft Windows™. Any or all of program instructions may be provided as a computer program product, or software, that may include a non-transitory computer-readable storage medium having stored thereon instructions, which may be used to program a computer system (or other electronic devices) to perform a process according to various implementations. A non-transitory computer-readable storage medium may include any mechanism for storing information in a form (e.g., software, processing application) readable by a machine (e.g., a computer). Generally speaking, a non-transitory computer- accessible medium may include computer-readable storage media or memory media such as magnetic or optical media, e.g., disk or DVD/CD-ROM coupled to the computer system via the I/O interface. A non-transitory computer-readable storage medium may also include any volatile or non-volatile media such as RAM or ROM that may be included in some embodiments of the computer system as system memory or another type of memory. In other implementations, program instructions may be communicated using optical, acoustical or other form of propagated signal (e.g., carrier waves, infrared signals, digital signals, etc.) conveyed via a communication medium such as a network and/or a wired or wireless link, such as may be implemented via a network interface. A network interface may be used to interface with other devices, which may include other computer systems or any type of external electronic device. In general, system memory, persistent storage, and/or remote storage accessible on other devices through a network may store data blocks, replicas of data blocks, metadata associated with data blocks and/or their state, database configuration information, and/or any other information usable in implementing the routines described herein.
[0031 ] In certain implementations, the I/O interface may coordinate I/O traffic between processors, system memory, and any peripheral devices in the system, including through a network interface or other peripheral interfaces. In some embodiments, the I/O interface may perform any necessary protocol, timing or other data transformations to convert data signals from one component (e.g., system memory) into a format suitable for use by another component (e.g., processors). In some embodiments, the I/O interface may include support for devices attached through various types of peripheral buses, such as a variant of the Peripheral Component Interconnect (PCI) bus standard or the Universal Serial Bus (USB) standard, for example. Also, in some embodiments, some or all of the functionality of the I/O interface, such as an interface to system memory, may be incorporated directly into the processor(s).
[0032] A network interface may allow data to be exchanged between a computer system and other devices attached to a network, such as other computer systems (which may implement one or more storage system server nodes, primary nodes, read-only node nodes, and/or clients of the database systems described herein), for example. In addition, the I/O interface may allow communication between the computer system and various I/O devices and/or remote storage. Input/output devices may, in some embodiments, include one or more display terminals, keyboards, keypads, touchpads, scanning devices, voice or optical recognition devices, or any other devices suitable for entering or retrieving data by one or more computer systems. These may connect directly to a particular computer system or generally connect to multiple computer systems in a cloud computing environment, grid computing environment, or other system involving multiple computer systems. Multiple input/output devices may be present in communication with the computer system or may be distributed on various nodes of a distributed system that includes the computer system. The user interfaces described herein may be visible to a user using various types of display screens, which may include CRT displays, LCD displays, LED displays, and other display technologies. In some implementations, the inputs may be received through the displays using touchscreen technologies, and in other implementations the inputs may be received through a keyboard, mouse, touchpad, or other input technologies, or any combination of these technologies.
[0033] In some embodiments, similar input/output devices may be separate from the computer system and may interact with one or more nodes of a distributed system that includes the computer system through a wired or wireless connection, such as over a network interface. The network interface may commonly support one or more wireless networking protocols (e.g., Wi-Fi/IEEE 802.11 , or another wireless networking standard). The network interface may support communication via any suitable wired or wireless general data networks, such as other types of Ethernet networks, for example. Additionally, the network interface may support communication via telecommunications/telephony networks such as analog voice networks or digital fiber communications networks, via storage area networks such as Fibre Channel SANs, or via any other suitable type of network and/or protocol.
[0034] Any of the distributed system embodiments described herein, or any of their components, may be implemented as one or more network-based services in the cloud computing environment. For example, a read-write node and/or readonly nodes within the database tier of a database system may present database services and/or other types of data storage services that employ the distributed storage systems described herein to clients as network-based services. In some embodiments, a network-based service may be implemented by a software and/or hardware system designed to support interoperable machine-to-machine interaction over a network. A web service may have an interface described in a machine-processable format, such as the Web Services Description Language (WSDL). Other systems may interact with the network-based service in a manner prescribed by the description of the network-based service’s interface. For example, the network-based service may define various operations that other systems may invoke, and may define a particular application programming interface (API) to which other systems may be expected to conform when requesting the various operations.
[0035] In various embodiments, a network-based service may be requested or invoked through the use of a message that includes parameters and/or data associated with the network-based services request. Such a message may be formatted according to a particular markup language such as Extensible Markup Language (XML), and/or may be encapsulated using a protocol such as Simple Object Access Protocol (SOAP). To perform a network-based services request, a network-based services client may assemble a message including the request and convey the message to an addressable endpoint (e.g., a Uniform Resource Locator (URL)) corresponding to the web service, using an Internet-based application layer transfer protocol such as Hypertext Transfer Protocol (HTTP). In some embodiments, network-based services may be implemented using Representational State Transfer (REST) techniques rather than message-based techniques. For example, a network-based service implemented according to a REST technique may be invoked through parameters included within an HTTP method such as PUT, GET, or DELETE.
[0036] Unless otherwise stated, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although any methods and materials similar or equivalent to those described herein can also be used in the practice or testing of the present invention, a limited number of the exemplary methods and materials are described herein. It will be apparent to those skilled in the art that many more modifications are possible without departing from the inventive concepts herein.
[0037] All terms used herein should be interpreted in the broadest possible manner consistent with the context. When a grouping is used herein, all individual members of the group and all combinations and subcombinations possible of the group are intended to be individually included. When a range is stated herein, the range is intended to include all subranges and individual points within the range. All references cited herein are hereby incorporated by reference to the extent that there is no inconsistency with the disclosure of this specification.
[0038] The present invention has been described with reference to certain preferred and alternative embodiments that are intended to be exemplary only and not limiting to the full scope of the present invention, as set forth in the appended claims.

Claims

CLAIMS:
1 . A method for executing a query against a dataset, the method comprising the steps of: receiving an inverted index query, wherein the inverted index query comprises at least one operation, and wherein the at least one operation comprises at least one term; converting the inverted index query to an abstract syntax tree, wherein each of the at least one operation and each of the at least one term are assigned to a node in the abstract syntax tree, and further wherein each of the at least one term for each of the at least one operation is assigned to a child node with respect to the node of the at least one operation; converting the abstract syntax tree to an equivalent flattened columnar data query; generating a final query from the flattened columnar data query, wherein the final query comprises a tree structure equivalent to the flattened columnar data query; executing the final query at a compute cluster to produce a result; and outputting the result to an output location.
2. The method of claim 1 , wherein the inverted index query comprises a JavaScript Object Notation (JSON) query.
3. The method of claim 2, wherein the flattened columnar data query comprises a structured query language (SQL) query.
4. The method of claim 3, further comprising the step of, before executing the final query at the compute cluster, estimating a required size for the compute cluster.
5. The method of claim 4, wherein the step of converting the abstract syntax tree to an equivalent flattened columnar data query comprises the step of converting each node in the abstract syntax tree to an operator or a function.
6. The method of claim 5, wherein the operator or function is an SQL operator or an SQL function.
7. The method of claim 6, wherein the step of converting the abstract syntax tree to an equivalent flattened columnar data query uses a conversion table between JSON elements and SQL elements.
8. The method of claim 7, further comprising the step of generating a node list from the flattened columnar data query.
9. The method of claim 8, wherein the step of generating a node list from the flattened columnar data query comprises a post-order tree traversal.
10. The method of claim 9, wherein the step of generating a final query from the flattened columnar data query is performed using a recursive function.
11 . A system comprising at least one processor and at least one memory, the at least one memory in communication with the at least one processor and comprising an instruction set, the instruction set configured to cause the processor to perform operations comprising: receiving an inverted index query, wherein the inverted index query comprises at least one operation, and wherein the at least one operation comprises at least one term; converting the inverted index query to an abstract syntax tree, wherein each of the at least one operation and each of the at least one term are assigned to a node in the abstract syntax tree, and further wherein each of the at least one term for each of the at least one operation is assigned to a child node with respect to the node of the at least one operation; converting the abstract syntax tree to an equivalent flattened columnar data query; generating a final query from the flattened columnar data query, wherein the final query comprises a tree structure equivalent to the flattened columnar data query; executing the final query at a compute cluster to produce a result; and outputting the result to an output location.
12. The system of claim 11 , wherein the inverted index query comprises a JavaScript Object Notation (JSON) query.
13. The system of claim 12, wherein the flattened columnar data query comprises a structured query language (SQL) query.
14. The system of claim 13, wherein the instruction set is further configured to cause the processor to, before executing the final query at the compute cluster, estimating a required size for the compute cluster.
15. The system of claim 14, wherein the instruction set is further configured to cause the processor to convert each node in the abstract syntax tree to an operator or a function.
16. The system of claim 15, wherein the operator or function is an SQL operator or an SQL function.
17. The system of claim 16, wherein the instruction set is further configured to cause the processor to use a conversion table between JSON elements and SQL elements to convert each node in the abstract syntax tree to an operator or a function.
18. The system of claim 17, wherein the instruction set is further configured to cause the processor to generate a node list from the flattened columnar data query.
19. The system of claim 18, wherein the instruction set is further configured to cause the processor to perform a post-order tree traversal.
20. The system of claim 19, wherein the instruction set is further configured to cause the processor use a recursive function while generating a final query from the flattened columnar data query.
PCT/US2023/014141 2022-02-28 2023-02-28 Query splitter for an inverted index datastore WO2023164294A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202263314704P 2022-02-28 2022-02-28
US63/314,704 2022-02-28

Publications (1)

Publication Number Publication Date
WO2023164294A1 true WO2023164294A1 (en) 2023-08-31

Family

ID=87766721

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/014141 WO2023164294A1 (en) 2022-02-28 2023-02-28 Query splitter for an inverted index datastore

Country Status (1)

Country Link
WO (1) WO2023164294A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040064466A1 (en) * 2002-09-27 2004-04-01 Oracle International Corporation Techniques for rewriting XML queries directed to relational database constructs
US20050114314A1 (en) * 2003-11-26 2005-05-26 International Business Machines Corporation Index structure for supporting structural XML queries
US20050289125A1 (en) * 2004-06-23 2005-12-29 Oracle International Corporation Efficient evaluation of queries using translation
US20060224576A1 (en) * 2005-04-04 2006-10-05 Oracle International Corporation Effectively and efficiently supporting XML sequence type and XQuery sequence natively in a SQL system
US20100088665A1 (en) * 2008-10-03 2010-04-08 Microsoft Corporation Tree-based directed graph programming structures for a declarative programming language

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040064466A1 (en) * 2002-09-27 2004-04-01 Oracle International Corporation Techniques for rewriting XML queries directed to relational database constructs
US20050114314A1 (en) * 2003-11-26 2005-05-26 International Business Machines Corporation Index structure for supporting structural XML queries
US20050289125A1 (en) * 2004-06-23 2005-12-29 Oracle International Corporation Efficient evaluation of queries using translation
US20060224576A1 (en) * 2005-04-04 2006-10-05 Oracle International Corporation Effectively and efficiently supporting XML sequence type and XQuery sequence natively in a SQL system
US20100088665A1 (en) * 2008-10-03 2010-04-08 Microsoft Corporation Tree-based directed graph programming structures for a declarative programming language

Similar Documents

Publication Publication Date Title
US11645471B1 (en) Determining a relationship recommendation for a natural language request
US20200183932A1 (en) Optimizing write operations in object schema-based application programming interfaces (apis)
JP6617117B2 (en) Scalable analysis platform for semi-structured data
US11068439B2 (en) Unsupervised method for enriching RDF data sources from denormalized data
US11670288B1 (en) Generating predicted follow-on requests to a natural language request received by a natural language processing system
US10116725B2 (en) Processing data retrieval requests in a graph projection of an application programming interfaces (API)
US11288319B1 (en) Generating trending natural language request recommendations
US11886430B1 (en) Intent-based natural language processing system
US11475053B1 (en) Providing completion recommendations for a partial natural language request received by a natural language processing system
US11030242B1 (en) Indexing and querying semi-structured documents using a key-value store
US10102246B2 (en) Natural language consumer segmentation
US10599654B2 (en) Method and system for determining unique events from a stream of events
US11762775B2 (en) Systems and methods for implementing overlapping data caching for object application program interfaces
JP5677319B2 (en) Web-based diagram visual extensibility
CN111708805A (en) Data query method and device, electronic equipment and storage medium
US11726999B1 (en) Obtaining inferences to perform access requests at a non-relational database system
CN108536728A (en) A kind of data query method and apparatus
Parmar et al. MongoDB as an efficient graph database: An application of document oriented NOSQL database
Ribeiro et al. Transparent Interoperability Middleware between Data and Service Cloud Layers.
US10657188B2 (en) Representational state transfer resource collection management
US11609924B2 (en) Database query execution on multiple databases
US10885157B2 (en) Determining a database signature
CN115292313A (en) Pseudo-column implementation method and device, electronic equipment and storage medium
WO2023164294A1 (en) Query splitter for an inverted index datastore
CN115905274A (en) Data processing method and device, electronic equipment and medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23760789

Country of ref document: EP

Kind code of ref document: A1