WO2023096587A2

WO2023096587A2 - Device and method for preparing data for responding to database queries

Info

Publication number: WO2023096587A2
Application number: PCT/SG2022/050869
Authority: WO
Inventors: Liufeng WANG; Libin ZHOU; Wei Hu; Lei Feng
Original assignee: Shopee IP Singapore Private Limited
Priority date: 2021-11-29
Filing date: 2022-11-29
Publication date: 2023-06-01
Also published as: WO2023096587A3

Abstract

Aspects concern method for preparing data for responding to database queries comprising receiving a message having a payload, wherein the payload includes a plurality of data elements, wherein each data element of the plurality of data elements is associated with a combination of array indices, generating, for each data element, a key-value pair by concatenating the array indices of the combination of array indices with which the data element is associated to a key and assigning the data element as value to the key, forming a data file comprising the key-value pairs; and storing the data file.

Description

DEVICE AND METHOD FOR PREPARING DATA FOR RESPONDING TO DATABASE QUERIES

TECHNICAL FIELD

[0001] Various aspects of this disclosure relate to devices and methods for preparing data for responding to database queries.

BACKGROUND

[0002] Providing data on-demand efficiently and promptly has become an important requirement for many data processing applications, in particular in machine learning and data- based calculation systems and typically directly relates to the accuracy of (e.g. calculation) results that can be achieved.

[0003] However, in a data distribution environment data may come from different data sources which may use very different data formats, in particular custom and in-house data formats. Handling custom and in-house formats is a tough challenge. It often requires hard- coded implementation case by case and brings high development cost.

[0004] Accordingly, efficient approaches are desirable which allow preparing data (received from different sources) for responding to queries to the data (in other words, for responding to database queries).

SUMMARY

[0005] Various embodiments concern a method for preparing data for responding to database queries including receiving a message having a payload, wherein the payload includes a plurality of data elements, wherein each data element of the plurality of data elements is associated with a combination of array indices, generating, for each data element, a key-value pair by concatenating the array indices of the combination of array indices with which the data element is associated to a key and assigning the data element as value to the key, forming a data file including the key-value pairs; and storing the data file.

[0006] According to one embodiment, the combination of array indices includes an array index for each of a plurality of hierarchy levels. [0007] According to one embodiment, the method includes generating a recursive map structure including a recursive map element for each data element by referring, for the data element, starting from the array index of the highest hierarchy level of the plurality of hierarchy levels and until the array index of the second-lowest hierarchy level, from each hierarchy level to the array index of one hierarchy level lower and referring, from the array index of the lowest hierarchy level to the data element and generating, from each data element, the key-value pair from the recursive map element for the data element.

[0008] According to one embodiment, the method includes assigning the payload to an object and the data elements to fields of the object and including generating the recursive map structure from the object.

[0009] According to one embodiment, the method includes expressing each array index as a string.

[0010] According to one embodiment, concatenating the array indices of the combination of array indices includes forming a JSON path of the array indices.

[0011] According to one embodiment, forming the JSON path includes forming the JSON path from the array indices in the order of hierarchy levels of the array indices.

[0012] According to one embodiment, the method includes concatenating the key-value pairs to a JSON string.

[0013] According to one embodiment, the method includes receiving a request for a data element, identifying the data element by a key of a key-value pair and providing the value of the key-value pair in response to the request.

[0014] According to one embodiment, the method includes receiving a plurality of messages, each message having a payload, wherein the payload includes a plurality of data elements, wherein each data element of the plurality of data elements is associated with a combination of array indices and including, for each received message, generating, for each data element, a key-value pair by concatenating the array indices of the combination of array indices with which the data element is associated to a key and assigning the data element as value to the key and forming a data file for the message including the key-value pairs. (These may be multiple files or one file for all for all messages, i.e. key-value pairs generated from data elements of different messages may or may not be stored together in the same file).

[0015] According to one embodiment, the method includes receiving the messages from multiple data sources. [0016] According to one embodiment, messages received from different data sources have different message formats.

[0017] According to one embodiment, a data server computer system is provided including a communication interface, a memory and a processing unit configured to perform the method for preparing data for responding to database queries.

[0018] According to one embodiment, a computer program element is provided including program instructions, which, when executed by one or more processors, cause the one or more processors to perform the method for preparing data for responding to database queries.

[0019] According to one embodiment, a computer-readable medium is provided including program instructions, which, when executed by one or more processors, cause the one or more processors to perform the method for preparing data for responding to database queries. The computer-readable medium may include a non-transitory computer-readable medium.

[0020] It should be noted that embodiments described in context of the method for preparing data for responding to database queries are analogously valid for the data server computer.

BRIEF DESCRIPTION OF THE DRAWINGS

[0021] The invention will be better understood with reference to the detailed description when considered in conjunction with the non-limiting examples and the accompanying drawings, in which:

- FIG. 1 shows a data processing system.

- FIG. 2 shows a data processing pipeline, for example implemented by the server 103.

- FIG. 3 illustrates an access of data on JSON level.

- FIG. 4 shows a flow diagram illustrating a processing of the data processing the pipeline according to an embodiment.

- FIG. 5 illustrates a raw data message stream according to a first system (“system A”).

- FIG. 6 illustrates a raw data message stream according to a second system (“system B”).

- FIG. 7 illustrates a system A data object representation.

- FIG. 8 illustrates a system B data object representation.

- FIG. 9 illustrates a system A recursive map data representation.

- FIG. 10 illustrates a system B recursive map data representation. - FIG. 11 illustrates a system A JSON data representation.

- FIG. 12 illustrates a system B JSON data representation.

- FIG. 13 illustrates system A string field values.

- FIG. 14 illustrates system B string field values.

- FIG. 15 illustrates reading out system A values.

- FIG. 16 illustrates reading out system B values.

- FIG. 17 shows a flow diagram illustrating a method for preparing data for responding to database queries.

- FIG. 18 shows a data server computer system according to an embodiment.

DETAILED DESCRIPTION

[0022] The following detailed description refers to the accompanying drawings that show, by way of illustration, specific details and embodiments in which the disclosure may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the disclosure. Other embodiments may be utilized and structural, and logical changes may be made without departing from the scope of the disclosure. The various embodiments are not necessarily mutually exclusive, as some embodiments can be combined with one or more other embodiments to form new embodiments.

[0023] Embodiments described in the context of one of the devices or methods are analogously valid for the other devices or methods. Similarly, embodiments described in the context of a device are analogously valid for a method, and vice-versa.

[0024] Features that are described in the context of an embodiment may correspondingly be applicable to the same or similar features in the other embodiments. Features that are described in the context of an embodiment may correspondingly be applicable to the other embodiments, even if not explicitly described in these other embodiments. Furthermore, additions and/or combinations and/or alternatives as described for a feature in the context of an embodiment may correspondingly be applicable to the same or similar feature in the other embodiments.

[0025] In the context of various embodiments, the articles “a”, “an” and “the” as used with regard to a feature or element include a reference to one or more of the features or elements.

[0026] As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.

[0027] In the following, embodiments will be described in detail. [0028] FIG. 1 shows a data processing system 100.

[0029] The data processing system 100 includes a plurality of data sources 101. These may be servers hosting any kind of data base, computers coupled with sensors that perform measurements, data input terminals, etc. The data sources 101 are coupled via a communication network 102 (or also an arrangement of communication networks), e.g. including or consisting of the Internet but possibly also including non-public internal communication networks, to a data sever 103. The data server 103 is provided to provide responses 105 to queries 104 sent by client 106.

[0030] It should be noted that the data server 103 and one or more clients 106 may be implemented by the same computer but may also be implemented by different computers (in particular, the server 103 may be implemented by a cluster of computers, i.e. may be implemented by a server computer system, possibly including multiple computers). In the latter case, the computers (implementing the clients 106 and the server 103) may for example be connected by the communication network 102 or one or more other communication networks. [0031] The data processing system 100 may be seen to implement a data distribution environment (i.e. data is distributed via the data server 103 from the data sources 101 to the clients 106).

[0032] A query 104 of a client 106 may relate to data provided by any one of the data sources 101. However, the data sources 101 may provide data in different (e.g. custom and inhouse) formats which gives rise to the following issues, in particular in industry:

• Regarding data processing, various formats and nested structured data cost lots of resources to parse or unpack the data (e.g. data frames) before it is possible to access a certain field (e.g. as specified by a query 104).

• Regarding the processing architecture, un-flattening relational data sets or streams can easily cause load skew and high availability issues before access specified fields, even just read, in distributed clusters (e.g. in case the server 103 is implemented by a cluster of server computers).

[0033] In view of the above, according to various embodiments, a data processing system is implemented in which structured packages to are flattened to (non-relational) key-value pairs (e.g. by the data server 103 for the example of the data processing system 100). This makes the data friendlier to a distributed environment like the data processing system 100 and allows high timely calculation with low development cost. [0034] In contrast to providing a data transform interface and the implementing case by case hard code for different kinds of messages from data sources 101, various embodiments implement a reusable framework which can be easily adapted for different data types of messages (like bytes stream Protobuf or char stream JSON) from different data sources 101.

[0035] In other words, according to various embodiments, a reusable data processing pipeline is provided which flattens structured data into non-relational KV pair form, which is friendlier to distributed environments.

[0036] FIG. 2 shows a data processing pipeline 200, for example implemented by the server 103.

[0037] In this example, the server 103 receives data from two data sources 101: a first data set 201 (provided, e.g. structured, in accordance with a first system, e.g. data format (encoding, wrapping, compression, etc.), “system A”) and a second data set 202 (provided, e.g. structured, in accordance with a first system, e.g. data format (encoding, wrapping, compression, etc.), “system B”). The data sets 201, 202 are for example raw (message) data streams.

[0038] The server 103 processes the data sets by

• Deserializing the raw data streams and transforming each deserialized raw stream to a respective object 209, 210 (i.e. transforming from message level 203 to object level 204)

• Transforming each object 209, 210 to a respective recursive map data structure 211 (i.e. transforming from object level 204 to map level 205).

• Transforming each recursive map data structure 211 to a respective JSON file (i.e. transforming from map level 205 to JSON level 206).

[0039] For responding to a query 104, the server 103 may then extract values of key-value pairs by JSONPath (on JSONPath level 207).

[0040] The server may further process the extracted values (on reader logic level 208), e.g. perform type conversion, row filtering, field filtering and return default values for null/error fields in response to a query.

[0041] The approach of FIG. 2 allows the server 103 to transform and normalize various end data formats (used by the various data sources 101 to deliver data) into uniform recursive key-value maps, which can be randomly accessed and friendly to distributed environments.

[0042] FIG. 3 illustrates an access of data on JSON level 206. For each value 301, the key is a JSON path 302. Values may be read out using for-loops to loop over the keys. This means that if a JSON path is given by $.a.b.c, multiple values may be read out by looping over the indices a, b and c wherein there may be an outer loop to loop over a (i.e. over possible values, e.g. strings, of a), an intermediate loop to loop over b and an inner loop to loop over c, wherein in each iteration of the inner loop, a GET FIELD command 303 is executed to get the value for the path specified by the current combination of a, b and c.

[0043] FIG. 4 shows a flow diagram 400 illustrating a processing by the data processing the pipeline 200.

[0044] As in the illustration of FIG. 2, examples are given for two data formats, a “system A” and a “system B”.

[0045] FIG. 5 illustrates a system A raw data message stream.

[0046] FIG. 6. illustrates a system B raw data message stream.

[0047] In 401, the messages are deserialized into objects.

[0048] FIG. 7 illustrates a system A data object representation, e.g. as Java objects (or objects of another object-oriented programming language). The object representation may be a representation in any data structure stored in computer memory space that presents information like C or BASIC (objects encapsulate relative methods more than a data structure). Thus, the object representation is a representation in a host programming language understandable "object" or "special data structure" (as in FIG. 7) in contrast to a network understand-able message stream (as in FIG. 5 and FIG. 6).

[0049] FIG. 8 illustrates a system B data object representation.

[0050] In 402, the server converts object fields into recursive map elements. It should be noted that because atomic fields can be wrapped (packaged) deeply in a message like an onion, usually, when data of inner fields or inner sub-fields needs to be obtained, different depth inner layers need to be decoded or unwrapped according to the respective packaging stack (e.g. format), which causes the outer layers to unwrap many times (or need to store many intermediate states). According to various embodiments, the innermost layer is unwrapped and all (or all on-demand) the innermost atomic fields are unwrapped in one stage with DFS (deep first search) and afterwards flattened to a KV layout. This can be seen to concentrate the unwrapping work into one stage, i.e. the conversion job only needs to be done one time.

[0051] FIG. 9 illustrates a system A recursive map data representation.

[0052] FIG. 10 illustrates a system B recursive map data representation. [0053] In 403, the server uses JSON API (Application Programmers Interface) and, for example, an existing JSON library to transform the recursive map data structure into standard JSON format.

[0054] FIG. 11 illustrates a system A JSON data representation.

[0055] FIG. 12 illustrates a system B JSON data representation.

[0056] In 404, the server forms key-value pairs, wherein each key is a standard JSON path(“$.a.b.c”), as e.g. used in front end and data analyses, and the associated value is the content of the field to which the JSON path points.

[0057] FIG. 13 illustrates system A string field values.

[0058] FIG. 14 illustrates system B string field values.

[0059] In 405, the server 103 uses the key-value pairs to randomly access fields (including inner parts).

[0060] FIG. 15 illustrates reading out system A values.

[0061] FIG. 16 illustrates reading out system B values.

[0062] In 406, other logic may be applied on the results of the data access by the server 103 or by the clients 106.

[0063] Examples for data formats used by the data sources 101 are for example HBase rows, hive rows and MySQL binlogs rows. These formats may even overlap and formed into a nested format data frame. For example, a field of a data row (according to a first data format) may be a row in a second data format.

[0064] While parsing such data case by case and leads to inflexibility, development resource costs, and redundant similar procedures, the approach described above with reference to of FIGs. 2 to 4 allows, because of the provided key-value data format, to look up data elements (i.e. atoms or leaf fields) in one operation, instead of recursively unpacking and traversing fields many times. Further, shortcuts may be included into the loops like determining how to access the next value when an abnormal situation occurs.

[0065] In a nested data structure, the structure layer count is O(log n) in average and can be 0(1) in best cases if all data fields are flattened (in 1 layer), i.e. computational effort to access data can be reduced by the approach described above from N * O(log n) to N * 0(1) where N = data accessing times and n = number of data fields.

[0066] In summary, according to various embodiments, a method is provided as illustrated in FIG. 17. [0067] FIG. 17 shows a flow diagram 1700 illustrating a method for preparing data for responding to database queries.

[0068] In 1701, a message having a payload is received, wherein the payload includes a plurality of data elements, wherein each data element of the plurality of data elements is associated with a combination of array indices.

[0069] In 1702, a key-value pair is generated for each data element by concatenating the array indices of the combination of array indices with which the data element is associated to a key and assigning the data element as value to the key.

[0070] In 1703, a data file including the key-value pairs is formed.

[0071] In 1704, the data file is stored (for responding to data base queries).

[0072] According to various embodiments, in other words, data which is contained in messages in the form of a hierarchical data structure, e.g. a multi-dimensional array or in the form of nested lists (i.e. where a list element may be a list itself) is flattened to key-value pairs, wherein each key reflects the combination of array indices which refer to the respective data elements (e.g. a concatenation of numeric array indices, or a list names, etc.). This allows, irrespective of the message format and irrespective of the data structure according to which the message (originally) contained the data elements, to quickly respond to queries for the data elements, by identifying the respective key (e.g. by following a JSON path forming the respective key).

[0073] The approach of FIG. 17 may for example be used in big data processing platforms and cloud computing environments to reduce costs and increase efficiency

[0074] The method of FIG. 17 is for example carried out by a data server computer system as illustrated in FIG. 18.

[0075] FIG. 18 shows a data server computer system 1800 according to an embodiment. [0076] The data server computer system 1800 includes a communication interface 1801 (e.g. configured to receive messages and queries and provide responses to queries). The data server computer system 1800 further includes a processing unit 1802 and a memory 1803. The memory 1803 may be used by the processing unit 1802 to store, for example, formed data files. The data server computer system 1800 is configured to perform the method of FIG. 17.

[0077] The methods described herein may be performed and the various processing or computation units and the devices and computing entities described herein may be implemented by one or more circuits. In an embodiment, a "circuit" may be understood as any kind of a logic implementing entity, which may be hardware, software, firmware, or any combination thereof. Thus, in an embodiment, a "circuit" may be a hard-wired logic circuit or a programmable logic circuit such as a programmable processor, e.g. a microprocessor. A "circuit" may also be software being implemented or executed by a processor, e.g. any kind of computer program, e.g. a computer program using a virtual machine code. Any other kind of implementation of the respective functions which are described herein may also be understood as a "circuit" in accordance with an alternative embodiment.

[0078] While the disclosure has been particularly shown and described with reference to specific embodiments, it should be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. The scope of the invention is thus indicated by the appended claims and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced.

Claims

CLAIMS A method for preparing data for responding to database queries comprising: receiving a message having a payload, wherein the payload includes a plurality of data elements, wherein each data element of the plurality of data elements is associated with a combination of array indices, the combination of array indices comprises an array index for each of a plurality of hierarchy levels; performing a deep first search associated with an innermost level of the plurality of hierarchy levels; generating, for each data element, a key-value pair by concatenating the array indices of the combination of array indices with which the data element is associated to a key and assigning the data element as value to the key; forming a data file comprising the key-value pairs; and storing the data file; wherein generating the key-value pair includes generating a recursive map structure comprising a recursive map element for each data element by referring, for the data element, starting from the array index of the highest hierarchy level of the plurality of hierarchy levels and until the array index of the second-lowest hierarchy level, from each hierarchy level to the array index of one hierarchy level lower and referring, from the array index of the lowest hierarchy level to the data element and generating, from each data element, the key-value pair from the recursive map element for the data element. The method of claim 1, further comprising the step of performing at least one of a type conversion, a row filtering, a field filtering and a default values returning for null/error fields in response to the database queries. The method of claim 1, comprising assigning the payload to an object and the data elements to fields of the object and comprising generating the recursive map structure from the object. The method of any one of claims 1 to 3, comprising expressing each array index as a string. The method of any one of claims 1 to 4, wherein concatenating the array indices of the combination of array indices comprises forming a JSON path of the array indices. The method of claim 5, wherein forming the JSON path comprises forming the JSON path from the array indices in the order of hierarchy levels of the array indices. The method of any one of claims 1 to 6, comprising concatenating the key-value pairs to a JSON string. The method of any one of claims 1 to 7, comprising receiving a request for a data element, identifying the data element by a key of a key-value pair and providing the value of the key-value pair in response to the request. The method of any one of claims 1 to 8, comprising receiving a plurality of messages, each message having a payload, wherein the payload includes a plurality of data elements, wherein each data element of the plurality of data elements is associated with a combination of array indices and comprising, for each received message, generating, for each data element, a key-value pair by concatenating the array indices of the combination of array indices with which the data element is associated to a key and assigning the data element as value to the key; and forming a data file for the message comprising the key-value pairs. The method of any one of claims 1 to 9, comprising receiving the messages from multiple data sources. The method of any one of claims 1 to 10, wherein messages received from different data sources have different message formats. A data server computer system comprising a communication interface, a memory and a processing unit configured to perform the method of any one of claims 1 to 11. A computer program element comprising program instructions, which, when executed by one or more processors, cause the one or more processors to perform the method of any one of claims 1 to 11. A computer-readable medium comprising program instructions, which, when executed by one or more processors, cause the one or more processors to perform the method of any one of claims 1 to 11.