WO2023211813A1 - Génération de script dynamique pour exécution et agrégation d'interrogation distribuées - Google Patents

Génération de script dynamique pour exécution et agrégation d'interrogation distribuées Download PDF

Info

Publication number
WO2023211813A1
WO2023211813A1 PCT/US2023/019590 US2023019590W WO2023211813A1 WO 2023211813 A1 WO2023211813 A1 WO 2023211813A1 US 2023019590 W US2023019590 W US 2023019590W WO 2023211813 A1 WO2023211813 A1 WO 2023211813A1
Authority
WO
WIPO (PCT)
Prior art keywords
query
files
computer
implemented method
data
Prior art date
Application number
PCT/US2023/019590
Other languages
English (en)
Inventor
Luke Higgins
Robert Bruno
Original Assignee
Morgan Stanley Services Group Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/727,764 external-priority patent/US11645231B1/en
Priority claimed from US17/727,767 external-priority patent/US11520739B1/en
Priority claimed from US17/727,765 external-priority patent/US11500889B1/en
Application filed by Morgan Stanley Services Group Inc. filed Critical Morgan Stanley Services Group Inc.
Publication of WO2023211813A1 publication Critical patent/WO2023211813A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata

Definitions

  • This disclosure relates to systems and methods for generating, structuring, and executing queries on a data store, and more specifically, to systems and methods for preprocessing a data query to improve efficiency, distributing the data query to a cloud of processors, and post-processing the results to respond to the query.
  • SQL Structured Querying Language
  • RDBMS relational database management systems
  • return a single record such as a SELECT on a field that acts as a unique key
  • return a multitude of records such as a SELECT on a non-unique field, or on a JOIN of multiple tables
  • return single values derived from a data set such as a COUNT of all records that match a criterion, or a SUM of all values in a particular column for records that match the criterion.
  • a system for receiving and indexing a plurality of files for later querying is disclosed.
  • the system comprises one or more data stores and non-transitory memory comprising instructions.
  • the instructions When executed, the instructions cause processors to receive a plurality of files to be stored in the one or more data stores and store each file from the plurality of files in a bucket that exclusively stores files received during a predetermined window of time.
  • the system periodically indexes a plurality of keys from the plurality of files by establishing a two-tiered index for each key, comprising an index dictionary and a Bloom filter, such that for every value of every key from the plurality of keys, there exists an entry in the index dictionary for that key indicating which files of the of plurality of files have that value for that key, and such that the Bloom filter for that key has that value inserted.
  • the Bloom filter allows a query to be received comprising a Boolean OR/AND/NOT of a value for a key, and if the value has not been inserted in the Bloom filter for that key, the query fails before downloading any of the plurality of files from the one or more data stores; for more efficient time-based file searching; and for the created index dictionary to be stored in the one or more data stores in a same format as the plurality of files, and even generated by the same query process that the indexing otherwise facilitates.
  • a system for dynamically generating scripts to be executed during a query of a data store comprises a server comprising one or more processors and non-transitory memory comprising instructions.
  • the instructions When executed by the processors, the instructions cause the processors to receive a query comprising a time window of files to be searched, key values to search on, and one or both of a filter selecting a subset of the files to be searched and an aggregation of data from all of the files or from all of the files that are filtered.
  • the system dynamically generates a script based on contents of the query, optimizing the script from a template to build up only features necessary to satisfy the query; and distributes the generated script horizontally to a plurality of computing devices that will execute the query by calling the generated script on each of the files to be searched.
  • Various optimizations are provided in the dynamically generated script.
  • a system for distributing a query and aggregating results of the distributed query comprises one or more data stores, a distribution server comprising one or more processors, a plurality of computing devices, each computing device being associated with at least one data store of the one or more data stores, and non-transitory memory comprising instructions.
  • the instructions When executed, the instructions cause processors to receive a dynamically generated script representing a query of files in the one or more data stores and transmit the dynamically generated script to each computing device of the plurality of computing devices.
  • the system directs each computing device of the plurality of computing devices to perform the query on a subset of files in the one or more data stores and receives from each computing device a subset of results based on execution of the dynamically generated script on the subset of files.
  • the system may aggregate the subsets of results into a final result that represents a set of records that satisfy a query that was used to dynamically generate the script or that contain statistical summary information on records based on a query that was used to dynamically generate the script; and transmits the final result for storage and provision to a user who wrote the query.
  • FIG. 1 illustrates, in simplified form, a system of computing devices used to receive a query, pre-process the query, distribute the query, aggregate the results of the query, and return the results to a human user;
  • FIG. 2 depicts, in simplified form, a method of indexing and organizing files to facilitate future queries
  • FIG. 3 depicts, in simplified form, a method of dynamically generating a script to be executed by each querying device
  • FIG. 4 illustrates, in simplified form, a method of “enriching” queries and query results by associating a field’s data in a machine-readable format with a human-preferred means of expressing the same data;
  • FIG. 5 illustrates, in simplified form, a method of distributing a script representing a query to a cluster of computing devices and each computing device performing the query;
  • FIG. 6 depicts, in simplified form, a method for processing the separate results of the distributed query to assemble them or otherwise prepare them to be forwarded to the end user;
  • FIG. 7 depicts, in simplified form, additional steps that are performed based on any options specified by the requesting user and in order to prepare the system to execute a new query;
  • FIG. 8 is a high-level block diagram of a representative computing device that may be utilized to implement various features and processes described herein.
  • a system is provided to facilitate queries across staggering quantities of data in a filesystem-like construct.
  • the data to be queried is preferably stored in a NOSQL format such as Apache Parquet, JavaScript Object Notation (JSON), or another columnar format that does not require a formal relational database construct with tables or external schema, and can be stored in a generic “data lake” without concern for organization, such as the storage services that might be provided by Amazon S3 or other cloud-based storage providers.
  • NOSQL format such as Apache Parquet, JavaScript Object Notation (JSON), or another columnar format that does not require a formal relational database construct with tables or external schema
  • JSON JavaScript Object Notation
  • a generic “data lake” without concern for organization, such as the storage services that might be provided by Amazon S3 or other cloud-based storage providers.
  • the system allows an end user to create queries including SQL-like functions, such as select for a value in a column, select by regular expression being satisfied by a value in a column, group by value in a column, group by regex being satisfied a value in a column, joining on a column across two tabular formats, count of records having a value in a column, set of all unique values in a column, count of unique values present in a column, count of each value in a column, sum of values in a column, concatenation of fields with a delimiter, ratio of one column or sum of columns to another column or sum of columns, least value present in a column, greatest value present in a column, applying a filter to records that would otherwise be returned by a select, quartile statistics of the values in a column, median value in a column, mean value in a column, standard deviation of values in a column, and other statistics or manipulations of results.
  • SQL-like functions such as select for a value in a column, select by regular expression being satisfied by
  • End users need not be familiar with any particular querying language such as SQL, nor with programming in general, but instead might need only understand the JSON format and fields of a custom querying language to be use within said format, or might even construct a query by using a web-based graphical user interface (GUI).
  • GUI graphical user interface
  • enrichment_source “q_ips”, “enrichment_key” : “source_ip”, “enrichment_path” : “reference/ q_latest_dagg . j son” ⁇ ,
  • an optimized script is dynamically generated in Python (or a similar scripting language) based on the user query and on the features of the custom querying language that have been invoked by that particular query.
  • Python or a similar scripting language
  • the generated script is horizontally distributed to a cluster of computing devices, and executed by each in parallel.
  • trillions of records containing petabytes of data can be queried in a matter of minutes and useful data is provided to the user potentially while the user waits during a same session, instead of being processed and returned to the user hours or days later.
  • the results might be immediately provided to the end user via the GUI, might be transformed into a Parquet or JSON format and provided to the end user to download, might be routed to a separate destination for further processing, or might be saved in a format that allows recursive or iterative queries to be performed on the results of previous queries.
  • FIG. 1 illustrates, in simplified form, a system of computing devices used to receive a query, pre- process the query, distribute the query, aggregate the results of the query, and return the results to a human user.
  • a user interface on a client computer 100 may be used to submit a query 105 via an API.
  • the query 105 may, in a preferred embodiment, be typed into a text box in a web-based GUI and submitted by clicking a button that calls a JavaScript function to submit the query via the API.
  • more specialized software may take advantage of the API instead of a web browser; software may incorporate a dynamic linked library (DLL) that extends the API into a same language the software is written in; or the query definition may be submitted by other means, such as saving the query in a text file and uploading the text file.
  • DLL dynamic linked library
  • the query 105 is preferably JSON based, with a number of required and optional fields.
  • Required fields in a preferable embodiment include a beginning timestamp and an ending timestamp (to take advantage of time-based indexing and reduce the scope of a search), the columns to be included in the output of the query, the type of log files to limit the search to, the output format (single file or separate files), and either a key describing how to group data being obtained (for example, group by a particular field value or group all records matching a particular regex) or a filter (in order to retrieve full records matching a criterion instead of only obtaining some data from them). If a key has been provided, a filter may additionally be provided anyway to further refine the query.
  • the following query could be used to search for all SMTP records of emails sent by an SMTP server on the date December 1, 2021, identify the sender email address of each email via a regex, and count the number of emails sent by each email address, using that address as a key for grouping the output:
  • a web server 110 receives the query 105 via the API and may perform initial validation or optimization tasks on the query 105.
  • the data may be initially validated to confirm that it is in the JSON format and can be parsed, that it contains all required field names, that all field names are correct and contain no typographical errors, and so on. If there are errors in the structure of the query itself, they may be recoverable by replacing one form of punctuation with another or one misspelled word with a known querying keyword that appears to have been meant.
  • the server 110 may also check user entitlements sent along with the query 105 to confirm that the user is permitted to access the data that would be returned by the query, and prevent the query from being performed if permission is absent or remove prohibited columns from the query while retaining permitted columns. If a required field is absent, it may nonetheless be possible to perform the query by including a default value for that field, such as setting the “end” to the present and the “beginning” to 24 hours previously, if beginning and ending timestamps are omitted from the query. [0028] Any of the above checks or failsafes may result in a query that is somewhat changed from the initially submitted query, resulting in a validated query 115. In other embodiments, no changes may be performed, and the original query 105 may continue to be used without modification.
  • the validated query 115 or original query 105 is forwarded from the web server 110 to a queuing server 120 that tracks queries which are to be performed when a cluster of computing devices 135, or even multiple such clusters operating in parallel, that will actually perform the query are still busy with a previously submitted query.
  • the Ray clustering solution is used to control each of the computing devices in unison and distribute tasks among them.
  • a Python generation server 125 While waiting for the cluster of computing devices 135 to become available, a Python generation server 125 generates a custom Python script 130 to be used specifically for the validated query 115 or original query 105, based on the contents of that query, according to a method described further below in relation to FIG. 3.
  • languages other than Python might be used, such as Perl, PHP, Ruby, or similarly interpreted scripting languages.
  • a compiled language might be used to create an executable or interpretable program, such as C, C++, Java, etc.
  • the Python script 130 is forwarded to each of them, according to a method described further below in relation to FIG. 5.
  • Each computing device 135 has access to at least a portion of the data storage 140.
  • Amazon S3 acts in this capacity and provides the storage.
  • the storage 140 is generally unstructured and can act as a “data lake” with various raw files 145 having different internal formats and significances, captured or produced by different sensors or software, indiscriminately inserted into the storage. Some indexing of the inserted files is performed, according to a method described further below in relation to FIG. 2.
  • each of the files 145 is in the Apache Parquet format
  • the files could equally be stored in JSON format, or in another format that organizes data into one or more records, each with defined fields of data, such as a Comma Separated Value (CSV) database, some forms of Extensible Markup Language (XML), or other simple tabular or markup data formats.
  • CSV Comma Separated Value
  • XML Extensible Markup Language
  • One advantage of allowing JSON input as an option is that the ultimate output of each query is by default in the JSON format, and this allows for recursive querying, a query that operates solely on the results of a previous query.
  • each computing device 135 executes its instance of the Python script 130, it generates results 150 based on its portion of the storage 140.
  • the portion of storage may be defined by limited access — for example, if the storage 140 is distributed across multiple locations and a computing device 135 only has access to one of these locations.
  • the portion may instead, in a preferred embodiment, be the result of a division of a keyspace between the devices 135.
  • a key value in data may be hashed and assigned to a particular computing device 135 for processing based on the hash falling within a certain range assigned to that particular computing device 135.
  • the outputs 150 from these scripts 130 may optionally be forwarded to a re-aggregation server or cluster of servers 155 to be combined and/or processed into a final result 160, according to a process described further below in relation to FIG. 6.
  • a first computing device searching a first region of storage may have discovered two emails sent by johndoe@example.com, and a second computing device searching a second region of storage may have discovered three more email sent by johndoe@example.com.
  • the re-aggregation server or cluster of servers 155 would need to track counts for each key value in each result 150 and use them to generate results 160 showing that the email address had sent five emails during the query time window.
  • the re-aggregation server may simply generate a final result 160 that is the union or concatenation of all results 150 that were provided to it. In a preferred embodiment, however, no re-aggregation server is necessary. Instead, results 150 are stored in the storage 140 and made available for an end user to download and process in any order at the end user’s convenience. This asynchronous and divided result may often be necessary when a query may take hours to be performed and return such a large dataset that the entire result cannot be stored in a single device’s volatile memory at one time.
  • the individual results 150 or final results 160 are either directly transmitted to the request user’s computer 100 or, preferably, are sent to the storage 140 to be obtained by the user at a later time or interacted with in other ways, such as by a query that acts exclusively on these results to further aggregate or generate statistics related to the initial query.
  • the final result 160 is stored in the JSON format, though in other embodiments, other formats might be used, such as Parquet, XML, plaintext, or other formats suitable to store and present the requested information.
  • the results format is JSON or a similar compatible format to enable a recursive process of executing a first query, then executing a second query operating on the output generated by the first query, further aggregating or refining results.
  • each system or device from among the web server 110, the queueing server 120, the Python generation server 125, and the re-aggregation server 155 may in fact be a cluster of computing devices sharing functionality for concurrent processing.
  • these various computing elements are described as if they are one computing device or cluster each, a cloud-based solution with multiple access points to similar systems that synchronize their data and are all available as backups to one another may be preferable in some embodiments to a unique set of computing devices all stored at one location.
  • the specific number of computing devices and whether communication between them is network transmission between separate computing devices or accessing a local memory of a single computing device is not so important as the functionality that each part has in the overall scheme. What does remain of importance is that there are multiple computing devices 135 in a cluster to distribute the query among them and re-aggregate the results after each device has produced its individual result.
  • indexing of all files is not necessary for a query to be performed, lack of an index will require every file in the storage 140 to be independently checked for possible satisfaction of selection and filtering criteria provided by the user.
  • FIG. 2 depicts, in simplified form, a method of indexing and organizing files to facilitate future queries.
  • the storage 140 is organized into a series of “buckets” such that all files received during a given hour of a given day go into the same bucket, to facilitate time-based queries. Additional sub-buckets may be provided within each bucket based on file type or log type, source of the file, a value of a commonly accessed column, or other criteria. Further, buckets based on a particular unit or interval of time may be further sub-divided into more granular units of time, such as sub-buckets for each hour within a day bucket, or sub- buckets for each minute in an hour bucket. Each file stored is given a unique identifier to be used within the index.
  • Step 200 the system checks whether enough time has elapsed for a new bucket to be created (Step 205). As previously mentioned, this is, in a preferred embodiment, after one hour has passed since the last new time-based bucket was added, though shorter or longer lengths of time may be used for each bucket window if there are higher or lower volumes of data to be indexed. If the time has elapsed, a new bucket is created (Step 210).
  • the file is inserted into the most recently created bucket and the system may also determine any relevant sub-bucket (Step 215) as described above, based on file type or log type, source of the file, a value of a commonly accessed column, or other criteria.
  • indexes may be regularly generated by a distributed query, according to the methods described above and below. For example, in order to index every email sent by an SMTP server by sender or by recipient, a query to group every email by sender or recipient and output a column that is a concatenation of every unique ID of every email that was within that group would result in an index that could be consulted in future queries by sender or recipient.
  • the generated indexes are themselves stored in the same storage 140 as the underlying data for the sake of simplicity. In other embodiments, it may be desirable to have a specialized fast retrieval storage for the index to further improve efficiency.
  • these generated indexes are generated once per day, such that the software checks whether the current indexes are more than 24 hours old (Step 220) and if so, begins to generate the new indexes.
  • the preferred interval of one day corresponds to the preferred interval bucket size, so that each newly generated index is relevant to the contents of the newest bucket. If a different bucket size, such as an hour or a week is used, a different indexing frequency may be preferred as well.
  • each value of each key column (or a numerical offset that can be used to look up the value in an array) is inserted into the dictionary-based index, that value or offset is also inserted into a Bloom filter for that key (Step 225) that tracks whether a given value for a key has been indexed.
  • a Bloom filter always has a small possibility of returning a false positive and reporting that a value has been inserted when it actually was not, the inability to have a false negative and the fast lookup time on a Bloom filter allows for a search on for a key value that is not present to short circuit even faster and fail even before consulting the dictionary table of key values.
  • each index, both the dictionary and Bloom filter portions is horizontally distributed (Step 230) to ensure that each computing device 135 has access to the index.
  • Step 205 the time-based checks whether to add new buckets to the storage (Step 205) and whether to generate indexes (Step 220) are conceptually described as if they are triggers called when a file is received, various embodiments might schedule a cron job or other process to perform these tasks at the intended time independent of insertions into storage.
  • the system can begin performing queries for records or data aggregations of that data.
  • FIG. 3 depicts, in simplified form, a method of dynamically generating a script to be executed by each querying device.
  • Step 300 the query is checked (Step 300) for any contents that may interfere with insertion into a Python template (or template written in another scripting language), such as quotes or other special characters that may need to be escaped with backslashes or removed; use of Python keywords that may need to be quoted, escaped, or removed; or any other feature that may cause unexpected behavior when inserted into a script that will be interpreted and executed.
  • a Python template or template written in another scripting language
  • the required time window (both beginning and end) is used to select a set of buckets to which the search will be limited when the script is executed (Step 305). Because it is possible that files or logs gathered by a sensor may have been delayed somewhat, additional buckets may be added despite the intention that they represent windows of time that do not overlap with the query window. For example, if logs are to be searched that arrived between midnight and 1 a.m., the additional buckets for 1 a.m. to 2 a.m. and 2 a.m. to 3 a.m. might be included in case data that was originally generated between midnight and 1 a.m. did not arrive for ingestion until after 1 a.m. Depending on the source of data, any intermediate processing it may have undergone, and especially any human element delaying the transmission of data, the search window may be arbitrarily expanded further to reduce the risk of failing to find desired data.
  • the script is built up (Step 310) to include only features that will be necessary to accomplish the task. For example, a generic, pre-built solution would likely require a tree of if/else statements or case switches to test whether an output field is to be a sum (and if so, perform the addition), an average (and if so, perform the addition and division), etc.
  • a dynamically generated script can have code to sum or average if and only if those features are used in the query, without any conditional control structures being necessary. When said code is being executed millions or billions of times during performance of the query, the lack of unnecessary conditional structures leads to a marked improvement in performance.
  • a key might be defined in the query as, rather than the value of a particular column, a Python function (or function in another programming language) called upon the value of a column, allowing for even greater sophistication in a query.
  • a function might include a decryption function, allowing the raw data to be encrypted before storage and only decrypted at the moment the file is retrieved from storage and checked to see whether it satisfies the query.
  • any arbitrary function might be allowed, in a preferred embodiment, a number of capabilities often enabled by programming languages (such as accessing a filesystem or input/output devices) may be prevented, and instead have functions limited to those that only operate on variables in memory.
  • the resulting Python file is preferably saved to storage 140 (Step 315) to await the computing devices 135 becoming available to execute the script on the various files in storage.
  • Other optimizations such as distribution of specific functions from the Python file to specific devices in a Ray cluster, may also be performed.
  • the generated script contains code to attempt to take advantage of Boolean logic to short-circuit when a query is definitely satisfied or will definitely not be satisfied. For example, if there are two filters with a Boolean OR between them, the Python script will be written such that the first one to be checked is a key that is indexed, and if it is found to be present, will terminate and return the record without checking the other. If there are two filters with a Boolean AND between them, the Bloom filter part of the index may be checked first for a value of an indexed key, and if the value is not present in the Bloom filter, the search can immediately be terminated. Moreover, if there is an AND aspect to a query and any of the keys are not indexed, use of the index may be abandoned altogether, as it will be ultimately necessary to download and check each file sequentially, regardless.
  • filters When multiple filters are to be used, they are preferably expressed in disjunctive normal form, /. ⁇ ., a set of sub-filters with an implicit OR disjunction between them. Thus, if it is desired to obtain all records for which “A” and “B” are columns with a true value, one would include the single sub-query as follows:
  • the generated code may contain a variety of checks that will be performed at runtime when the query is executed, in order to optimize performance based on the results of the initial query. For example, if a query only requests a count of records satisfying a filter, the index may be consulted to count the records in the dictionary and satisfy the query without downloading any files. If a query is for a time window that is so recent the data has not yet been indexed, or if the query does not operate on any indexed fields, all use of the indexes may be abandoned and each file sequentially checked for possible satisfaction of the query.
  • the time window overlaps both a period when indexing was completed and a period when indexing was not completed — for example, a window for the previous seven days, but indexes for the present day have not been created and only exist for data between seven days and one day old.
  • a hybrid approach will be followed where the indexes are consulted in searching through the older data, but every file among the newer data is sequentially checked.
  • FIG. 4 illustrates, in simplified form, a method of “enriching” queries and query results by associating a field’s data in a machine-readable format with a human-preferred means of expressing the same data.
  • Enrichment is achieved by regularly generating a table of translations from a value to one or more associated values and storing them in the storage 140 to be accessed by the computing devices 135 or the re-aggregation server 155 at the time of executing the query.
  • Enrichment JSONs are generated via a daily cron job or similar scheduled software process (Step 400).
  • the JSON (or other file format) is propagated to the storage such that it is available to each computing device (Step 405), enabling those computing devices to perform pseudo-joins without communication with any other external source or database.
  • the query may have key, filtering, or output fields include one or more predefined enrichment fields (Step 410).
  • the user may use a UI to select an enrichment field from a list of available enrichment fields, and doesn't need to know the “join” syntax that would be required to achieve similar functionality in an SQL database system with multiple interrelated tables.
  • the above JSON might be used to search for files having a hostname of “grn” and aggregate their values, even though the underlying files actually only contain IP addresses, and not hostnames.
  • the data from the enrichment JSON (associating a particular IP address with a particular hostname) is automatically joined into a raw data structure with fields loaded from a Parquet file, even before the Python script begins to parse it (Step 415). As a result, the script is completely unaware that a given file did not contain the enriched information in its raw format.
  • FIG. 5 illustrates, in simplified form, a method of distributing a script representing a query to a cluster of computing devices and each computing device performing the query.
  • Step 500 the cluster of computing devices 135 is locked (Step 500) to ensure that they only work on one query at a time.
  • Step 505 The next pending query is selected from the query server (Step 505) and its associated dynamically generated script is retrieved (Step 510)
  • the cluster of computing devices 135 contains 256 actual machines or 256 virtual machines distributed among a number of actual machines capable of multi-core simultaneous processing.
  • Files from the storage 140 are preferably assigned (Step 515) to each of the computing devices 135 in a “round robin” style and/or by chunking the set of files into portions of equal size and assigning them sequentially to each of the computing devices, so that each computing device 135 receives an approximately equal 1/256 share of the files.
  • hashing files such that, if the first byte of the hash of the file represents the number N from 0 to 255, the A' /th machine or processor will retrieve that file and determine whether it satisfies the query. If fewer than 256 processors are capable of acting, files may be assigned by the value of the hash mod the number of processors; if more than 256 processors are desired, the files could be assigned based on the first 9 bits, 10 bits, etc., of the hash instead of only the first 8 bits. So long as an appropriate hashing function is used, files should be evenly distributed between the set of processors.
  • Step 520 Each machine, for each file, executes the dynamic script (Step 520) upon that file, and tracks the output of the script for each file. If the query is returning summary or statistical data rather than records, a dictionary is created (Step 525) and used to create an entry for each value for each key in the query and updated as each file is parsed, until all files have been parsed (Step 530) [0080] Post Processing and Aggregation
  • FIG. 6 depicts, in simplified form, a method for processing the separate results of the distributed query to assemble them or otherwise prepare them to be forwarded to the end user.
  • every query will either request in response a union set of all records that satisfy the query, or instead request derived or assembled statistical data on each value for one or more key columns.
  • each of the computing devices 135 may each be provided to a re-aggregation server or cluster of servers 155 (Step 600) for combination into final results 160.
  • a re-aggregation server or cluster of servers 155 Step 600
  • the partitioning may be performed by hashing the key value for each result and forwarding it to the TVth server of the cluster, where N is the hash, mod the number of servers.
  • the combination step acts essentially as a map reduce, applying a particular function to each of the sets of results, sorted by the key value. Alternatively, there may be no further re-aggregation required by the query, and the intermediate results 150 may be provided individually for storage or use by the requesting user.
  • the output options include enrichments that have not yet been applied to the data
  • enrichments are applied (Step 605) to augment query results with additional data from external sources. This can allow even more sophisticated queries that operate on the enriched data.
  • a significant power of data enrichment is that the enriched values can themselves be values or keys in other tabular data or unstructured data.
  • a second data source might be organized by hostname and contain additional data that can now be associated with each IP address during aggregation.
  • any statistics are calculated (Step 615) or the full set of records are assembled (Step 620) from the intermediate results 150.
  • a tracking variable may be used while iterating through each of the intermediate results 150.
  • more exotic data calculations might be performed than merely using a variable to track a sum, minimum, maximum, etc. as the results are iterated.
  • a query might request a classification of data matching filters into five classifications, without knowing in advance what the nature of the output would be.
  • the individual partial results 150 or final result 160 is stored (Step 625) in the storage 140 so that it can be accessed by the end user or by future scripts executing a recursive query upon query results.
  • the aggregation process uses fixed software rather than dynamically generated software as in the querying step. Efficiency gains from dynamic software at this step are somewhat diminished because the scope of the data to process has already been reduced by orders of magnitude after the distributed queries have been performed. Nevertheless, in some embodiments, it might be preferable to dynamically generate code to perform the aggregation of data results when the functions used or the type of data being aggregated could benefit from optimized code that lacks if statements or loops that will never be necessary.
  • FIG. 7 depicts, in simplified form, additional steps that are performed based on any options specified by the requesting user and in order to prepare the system to execute a new query.
  • Step 700 various statistics may be gathered (Step 700) concerning the execution of the query, such as total time, the number of files downloaded, number of records provided, and so on. These statistics may be saved for future analysis or provided to the end user along with the query results.
  • the lock on the Ray cluster is removed to enable the cluster to process a new query (Step 705).
  • the queueing server is notified (Step 710) that the lock has been removed and that code related to a new query can be sent to the Ray cluster.
  • the web server 110 is also notified that the process has been completed, to allow any web-based UI to indicate that the query is complete (Step 715).
  • the end user or the end user’s software may be responsible for periodically checking the storage to determine if new files for storing results have been created and have been unlocked after writing is complete.
  • any UI may also be updated to include all or a portion of the results (Step 720). For example, if a query only requests a count or other single statistic of all records that satisfy a query, that statistic may simply be displayed on the screen. If a query requests counts to be aggregated for each value of a particular key column, the UI may be configured to display the first five, or first one hundred, etc. key values, ordered in descending order by count numbers. If the output has been saved to an external file to which the user has access, a hyperlink or other information needed to obtain the file may be displayed via the UI.
  • FIG. 1 depicts a preferred configuration of computing devices and software modules to accomplish the software-implemented methods described above, those methods do not inherently rely on the use of any particular specialized computing devices, as opposed to standard desktop computers and/or web servers.
  • FIG. 8 describes various enabling devices and technologies related to the physical components and architectures described above.
  • FIG. 8 is a high-level block diagram of a representative computing device that may be utilized to implement various features and processes described herein, for example, the functionality of the user’s computing device 100, the web server 110, the queueing server 120, the Python generation server 125, the re-aggregation server 155, each distributed computing device 135, or any other computing device described.
  • the computing device may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system.
  • program modules may include routines, programs, objects, components, logic, data structures, and so on that perform particular tasks or implement particular abstract data types.
  • the computing device is illustrated in the form of a special purpose computer system.
  • the components of the computing device may include (but are not limited to) one or more processors or processing units 800, a system memory 810, and a bus 815 that couples various system components including memory 810 to processor 800.
  • Bus 815 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures.
  • bus architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
  • Processing unit(s) 800 may execute computer programs stored in memory 810. Any suitable programming language can be used to implement the routines of particular embodiments including C, C++, Java, assembly language, etc. Different programming techniques can be employed such as procedural or object oriented. The routines can execute on a single computing device or multiple computing devices. Further, multiple processors 800 may be used.
  • the computing device typically includes a variety of computer system readable media. Such media may be any available media that is accessible by the computing device, and it includes both volatile and non-volatile media, removable and non-removable media.
  • System memory 810 can include computer system readable media in the form of volatile memory, such as random access memory (RAM) 820 and/or cache memory 830.
  • the computing device may further include other removable/non-removable, volatile/non-volatile computer system storage media.
  • storage system 840 can be provided for reading from and writing to a non-removable, non-volatile magnetic media (not shown and typically referred to as a “hard drive”).
  • a magnetic disk drive for reading from and writing to a removable, non-volatile magnetic disk (e.g., a “floppy disk”)
  • an optical disk drive for reading from or writing to a removable, non-volatile optical disk such as a CD-ROM, DVD-ROM or other optical media
  • each can be connected to bus 815 by one or more data media interfaces.
  • memory 810 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments described in this disclosure.
  • Program/utility 850 having a set (at least one) of program modules 855, may be stored in memory 810 by way of example, and not limitation, as well as an operating system, one or more application software, other program modules, and program data. Each of the operating system, one or more application programs, other program modules, and program data or some combination thereof, may include an implementation of a networking environment.
  • the computing device may also communicate with one or more external devices 870 such as a keyboard, a pointing device, a display, etc.; one or more devices that enable a user to interact with the computing device; and/or any devices (e.g., network card, modem, etc.) that enable the computing device to communicate with one or more other computing devices. Such communication can occur via Input/Output (I/O) interface(s) 860.
  • external devices 870 such as a keyboard, a pointing device, a display, etc.
  • any devices e.g., network card, modem, etc.
  • I/O Input/Output
  • the computing device can communicate with one or more networks, such as a local area network (LAN), a general wide area network (WAN) and/or a public network (e.g., the Internet) via network adaptor 880.
  • network adaptor 880 communicates with other components of the computing device via bus 815.
  • bus 815 It should be understood that although not shown, other hardware and/or software components could be used in conjunction with the computing device. Examples include (but are not limited to) microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data archival storage systems, etc.
  • the present invention may be a system, a method, and/or a computer program product at any possible technical detail level of integration
  • the computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention
  • the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
  • the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • a non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or Flash memory erasable programmable read-only memory
  • SRAM static random access memory
  • CD-ROM compact disc read-only memory
  • DVD digital versatile disk
  • memory stick a floppy disk
  • a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon
  • a computer readable storage medium is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
  • Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
  • the network may use copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
  • Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, statesetting data, configuration data for integrated circuitry, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++, or the like, and procedural programming languages, such as the “C” programming language or similar programming languages.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user’s computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user’s computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
  • These computer readable program instructions may be provided to a processor of a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
  • the computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • the flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s).
  • each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne des procédés et des systèmes mis en oeuvre par ordinateur pour recevoir et indexer une pluralité de fichiers pour une interrogation ultérieure, pour générer dynamiquement des scripts à exécuter pendant une interrogation d'un magasin de données, et pour distribuer horizontalement une interrogation et agréger des résultats de l'interrogation distribuée.
PCT/US2023/019590 2022-04-24 2023-04-24 Génération de script dynamique pour exécution et agrégation d'interrogation distribuées WO2023211813A1 (fr)

Applications Claiming Priority (12)

Application Number Priority Date Filing Date Title
US17/727,764 US11645231B1 (en) 2022-04-24 2022-04-24 Data indexing for distributed query execution and aggregation
US17/727,767 US11520739B1 (en) 2022-04-24 2022-04-24 Distributed query execution and aggregation
US17/727,765 US11500889B1 (en) 2022-04-24 2022-04-24 Dynamic script generation for distributed query execution and aggregation
US17/727,767 2022-04-24
US17/727,764 2022-04-24
US17/727,765 2022-04-24
US18/138,011 US11954068B2 (en) 2022-04-24 2023-04-21 Data indexing for distributed query execution and aggregation
US18/138,013 2023-04-21
US18/138,014 2023-04-21
US18/138,013 US20230342332A1 (en) 2022-04-24 2023-04-21 Dynamic script generation for distributed query execution and aggregation
US18/138,011 2023-04-21
US18/138,014 US20230342333A1 (en) 2022-04-24 2023-04-21 Distributed query execution and aggregation

Publications (1)

Publication Number Publication Date
WO2023211813A1 true WO2023211813A1 (fr) 2023-11-02

Family

ID=88519507

Family Applications (3)

Application Number Title Priority Date Filing Date
PCT/US2023/019590 WO2023211813A1 (fr) 2022-04-24 2023-04-24 Génération de script dynamique pour exécution et agrégation d'interrogation distribuées
PCT/US2023/019587 WO2023211811A1 (fr) 2022-04-24 2023-04-24 Indexation de données pour exécution et agrégation d'interrogation distribuée
PCT/US2023/019593 WO2023211815A1 (fr) 2022-04-24 2023-04-24 Exécution et agrégation d'interrogation distribuée

Family Applications After (2)

Application Number Title Priority Date Filing Date
PCT/US2023/019587 WO2023211811A1 (fr) 2022-04-24 2023-04-24 Indexation de données pour exécution et agrégation d'interrogation distribuée
PCT/US2023/019593 WO2023211815A1 (fr) 2022-04-24 2023-04-24 Exécution et agrégation d'interrogation distribuée

Country Status (1)

Country Link
WO (3) WO2023211813A1 (fr)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190034453A1 (en) * 2017-07-26 2019-01-31 Quantum Corporation Flexible synchronous file system replication
US20190294701A1 (en) * 2018-03-26 2019-09-26 International Business Machines Corporation Data replication in a distributed file system
US20200082015A1 (en) * 2018-09-07 2020-03-12 Gravwell, Inc. Unstructured data fusion by content-aware concurrent data processing pipeline
US20210133193A1 (en) * 2019-10-30 2021-05-06 Boray Data Technology Co. Ltd. Dynamic query optimization
CN114090671A (zh) * 2021-11-23 2022-02-25 深圳市元征科技股份有限公司 数据导入方法、装置、电子设备及存储介质
US11500889B1 (en) * 2022-04-24 2022-11-15 Morgan Stanley Services Group Inc. Dynamic script generation for distributed query execution and aggregation

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7769755B2 (en) * 2006-11-30 2010-08-03 Microsoft Corporation Efficient execution of aggregation queries
EP2490135A1 (fr) * 2011-02-21 2012-08-22 Amadeus S.A.S. Procédé et système pour mettre à disposition des données statistiques à partir d'un entrepôt de données
US8930691B2 (en) * 2011-08-16 2015-01-06 Microsoft Corporation Dynamic symmetric searchable encryption
US9507824B2 (en) * 2014-08-22 2016-11-29 Attivio Inc. Automated creation of join graphs for unrelated data sets among relational databases
US10102269B2 (en) * 2015-02-27 2018-10-16 Microsoft Technology Licensing, Llc Object query model for analytics data access
US10678791B2 (en) * 2015-10-15 2020-06-09 Oracle International Corporation Using shared dictionaries on join columns to improve performance of joins in relational databases
US10686867B2 (en) * 2017-05-12 2020-06-16 Guavus, Inc. Scaled in-order record input ingestion for file-based streams in multi-threaded environments
US10642994B1 (en) * 2017-05-24 2020-05-05 Amazon Technologies, Inc. Probabilistic data structures for concordance management
RU2720905C2 (ru) * 2018-09-17 2020-05-14 Общество С Ограниченной Ответственностью "Яндекс" Способ и система для расширения поисковых запросов с целью ранжирования результатов поиска
US11074196B1 (en) * 2018-09-28 2021-07-27 Splunk Inc. Evicting data associated with a data intake and query system from a local storage
US11520739B1 (en) * 2022-04-24 2022-12-06 Morgan Stanley Services Group Inc. Distributed query execution and aggregation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190034453A1 (en) * 2017-07-26 2019-01-31 Quantum Corporation Flexible synchronous file system replication
US20190294701A1 (en) * 2018-03-26 2019-09-26 International Business Machines Corporation Data replication in a distributed file system
US20200082015A1 (en) * 2018-09-07 2020-03-12 Gravwell, Inc. Unstructured data fusion by content-aware concurrent data processing pipeline
US20210133193A1 (en) * 2019-10-30 2021-05-06 Boray Data Technology Co. Ltd. Dynamic query optimization
CN114090671A (zh) * 2021-11-23 2022-02-25 深圳市元征科技股份有限公司 数据导入方法、装置、电子设备及存储介质
US11500889B1 (en) * 2022-04-24 2022-11-15 Morgan Stanley Services Group Inc. Dynamic script generation for distributed query execution and aggregation

Also Published As

Publication number Publication date
WO2023211815A1 (fr) 2023-11-02
WO2023211811A1 (fr) 2023-11-02

Similar Documents

Publication Publication Date Title
US11860874B2 (en) Multi-partitioning data for combination operations
US11263211B2 (en) Data partitioning and ordering
US20200257691A1 (en) Executing untrusted commands from a distributed execution model
US9852144B2 (en) System and method for investigating large amounts of data
US11526504B1 (en) Search-time field extraction in a data intake and query system
US11520739B1 (en) Distributed query execution and aggregation
AU2015225694B2 (en) Managing data profiling operations related to data type
CN106687955B (zh) 简化将数据从数据源转移到数据目标的导入过程的调用
US11586585B2 (en) Method and system for historical call lookup in distributed file systems
US11687546B2 (en) Executing conditions with negation operators in analytical databases
US9218394B2 (en) Reading rows from memory prior to reading rows from secondary storage
US11921720B1 (en) Systems and methods for decoupling search processing language and machine learning analytics from storage of accessed data
US11450419B1 (en) Medication security and healthcare privacy systems
WO2020102638A1 (fr) Approche pour un compromis contrôlable entre le coût et la disponibilité de données indexées dans une solution d'agrégation de journal en nuage telle que splunk ou sumo
US11416458B2 (en) Efficient indexing for querying arrays in databases
US11500889B1 (en) Dynamic script generation for distributed query execution and aggregation
US11954068B2 (en) Data indexing for distributed query execution and aggregation
US11561993B2 (en) Generating real-time aggregates at scale for inclusion in one or more modified fields in a produced subset of data
WO2023211813A1 (fr) Génération de script dynamique pour exécution et agrégation d'interrogation distribuées
US11461302B1 (en) Storing multiple instances of data items to implement key overloading in database tables
CN117874082A (zh) 一种关联字典数据检索的方法及相关组件

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23797078

Country of ref document: EP

Kind code of ref document: A1