WO2019120629A1

WO2019120629A1 - On-demand snapshots from distributed data storage systems

Info

Publication number: WO2019120629A1
Application number: PCT/EP2018/052774
Authority: WO
Inventors: David Manzano Macho
Original assignee: Telefonaktiebolaget Lm Ericsson (Publ)
Priority date: 2017-12-21
Filing date: 2018-02-05
Publication date: 2019-06-27

Abstract

A data entry analysis module and associated method for analysing replication flows from a master node to a slave node of a distributed database, and comprising: a receiver for receiving replication flows with operations that update data entries in the distributed database; a pattern analyser for determining whether a predictive pattern exists for each operation, the predictive pattern indicating an expected value for parameters of each data entry; the pattern analyser controlling a transmitter to transmit, toward a snapshot handler, information relating to the operations and comprising: if a predictive pattern exists for an operation, information relating to the predictive pattern; and if a predictive pattern does not exist for an operation, the received replication flow with the operation. There is also provided a snapshot handler and associated method for generating a snapshot of a distributed database, the snapshot comprising a plurality of data entries, the snapshot handler comprising: a receiver for receiving, from a data entry analysis module, information relating to operations that update data entries in the distributed database, the received information comprising: if a predictive pattern exists for an operation, information relating to the predictive pattern, if a predictive pattern does not exist for an operation, the replication flow with the operation, which is unrelated to any predictive pattern; and a snapshot generator configured to: identify unchanged information for the operations; and generate a snapshot based on the predictive patterns for corresponding operations, the information unrelated to any predictive pattern for the corresponding operations and the unchanged information.

Description

ON-DEMAND SNAPSHOTS FROM DISTRIBUTED DATA STORAGE SYSTEMS

Technical field

The invention relates to distributed database systems, and more specifically, to the replication and management of data related to the distributed database system and/or the generation of snapshots of the distributed database.

Background

The value and use of information is continuously increasing, and businesses look for additional ways to process, store, analyse, and gain actionable insights from their data. Due to variations in information requirements between different applications, requirements for data management systems may also vary. Providing efficient ways to access data stored in a distributed database system is important for the development of analytics-based solutions using such data.

When handling large data volumes, the data can be distributed into independent, smaller sets, sometimes called shards, wherein each shard can be placed into different nodes. The nodes can be separated onto different hardware units, for example different servers, and each node may also be geographically separated from one or more of the other nodes. This separation of large data volumes into shards can improve performance and ensure scalability, by horizontally partitioning database contents into smaller and independent components. Each node takes care of the part of the data it contains. A node in such a distributed system can also be used store one or more replicas of other shards located in other nodes of the distributed system. In order to maintain these replica(s) of a shard, a node containing the master shard, the master node, must replicate the content of the master shard towards the other node or nodes, sometimes referred to as slave nodes, storing the one or more replica shards. Replication of data is important for providing high availability of the data, and to secure the data in case of failures in parts of the distributed data system. Replicating the data is useful for providing high availability of the data and maintaining security of the data in case of failures.

A snapshot is a static view of a database at a given time, and contains all the data entries committed into the whole or part of the database of which the snapshot provides a view, at the time of creation of the snapshot. At a given time a snapshot can be obtained by replicating, either continuously or periodically, the content of each data entry in the database from each node in a cluster of nodes of a distributed database system into another node managing the creation of the snapshot. If a request for a snapshot is sent to a cluster of nodes in a distributed database system, the data used to create the snapshot is usually taken from a slave node, which is a node containing a replica of data of a master node in the distributed data system, to avoid impact on the performance of the master node. In this approach of snapshot creation, an initial snapshot is taken from the database content, and content of the snapshot is further updated as data entries in the nodes changes. After the completion of the creation of the initial snapshot, updated data consumes additional storage space on top of the space needed for the initial snapshot.

The creation of a snapshot from a distributed database system as described above implies that all the updates from all the nodes in the system need to be collected to create a consolidated view of the database contents at a given time, to gather all changes since the last snapshot was created. The time span set for gathering updates made to the master nodes may change depending on how critical it is to keep the data safe, and how fast a snapshot is released following a snapshot request. In some cases, it is possible that an update to a master node is immediately propagated to a node managing the snapshot creation. When a snapshot is refreshed, all updates propagated to a snapshot handler are examined to see if they affect data entries stored in the snapshot, and if the snapshot should be updated to include a change to the content of the data entries. The larger the number of updates to data in the master nodes of the distributed database since the snapshot was last refreshed, the more time a snapshot refresh will take. This happens even if the refresh does not require any changes to be made to the snapshot (e.g. all updates of the master data leave the snapshot with the same data content, so an update to the data did not involve a change to the data content already stored by the snapshot node).

Summary

Creating a snapshot from a distributed database by collecting all the updates made to all the master nodes containing data to be contained in the snapshot requires all these changes to be forwarded to the node managing the creation of the snapshot. High- volume replications from nodes in a distributed database cluster, towards a snapshot node require a scalable solution to avoid overloading the replication channel and degrading the whole system performance. Performing this operation faster improves the accuracy of the data content of the snapshot. It may also happen that several snapshots are requested at consecutive points in time, and maintaining the replications needed to create this series of snapshots can aggravate problems with transporting large amounts of data from the distributed database, which can impact performance of the distributed database, and handling the data at the snapshot node, which can affect snapshot generation time. The approach known in the prior art has several drawbacks, including:

• In distributed database systems with a significant number of master nodes, the amount of data to be transferred to make a new snapshot may impact the performance of the replication procedure. It may consume a lot of network resources and in overloading situations, the snapshot generation may be delayed or even interrupted due to the time it takes to complete the replication of updates to data content across the snapshot. It is particularly relevant in cases of bulk updates, when large update operations are issued to the distributed database system, and these need to be replicated to the snapshot node.

• The physical location of one or more nodes of the distributed database system from which a snapshot is created can be located far from the physical location of the node managing snapshot creation. Network conditions, load on the nodes and shards through updates and/or other concurrent use of the distributed database system may cause delays to the creation of the snapshot.

• The amount of storage space required on a storage medium for storing the updates, the previous snapshot, and storage required to generate a new snapshot can be large in case of large databases. The number of input/output operations or write operations to a storage medium depends on many factors and is difficult, e.g. for a system administrator, to predict. As a result, it is difficult to determine the amount of space required to store snapshots and other content needed to generate them. At some point, an old snapshot should be replaced with a new one. The disk space required by a snapshot depends on how many different modifications have been made in the source database during the life of the snapshot. When it happens that all the database content is updated at least once, the snapshot update records will grow to the size of the database of which the snapshot is made.

• The amount of time it takes for a snapshot to be created may result in a long delay for the application requesting the snapshot. In the case the snapshot creation has scheduled periodic refreshes for snapshot generation, this periodic update might introduce a long delay until the existing snapshot is fully updated.

• In many cases, only a small portion of the data contained in a snapshot will be used by the application requesting it, but dealing with massive amounts of data implies transferring lots of information between the master nodes or their slaves and the snapshot node.

The invention as disclosed herein relates to generating an on-demand snapshot from a distributed database system when the distributed database system’s performance needs to be guaranteed, by reducing the resources devoted to managing and releasing a snapshot. Further, and more generally, one or more embodiments of the invention seek to mitigate or solve one or more of the problems with the prior art, including those mentioned herein.

According to the invention in an aspect, there is provided a data entry analysis module for analysing replication flows from a master node to a slave node of a distributed database. The data entry analysis module comprises a receiving means, which may be a receiver, configured to receive replication flows comprising one or more operations that update one or more data entries in the distributed database. The data entry analysis module comprises a pattern analysing means, which may be a pattern analyser configured to determine whether a predictive pattern exists for each of the one or more operations, wherein the predictive pattern indicates an expected value for one or more parameters of each one or more data entries. The pattern analyser is further configured to control a transmitting means, which may be a transmitter to transmit, toward a snapshot handler, information relating to the operations that comprises: if a predictive pattern exists for an operation, information relating to the predictive pattern; if a predictive pattern does not exist for an operation, the received replication flow comprising the operation, which is unrelated to any predictive pattern.

Optionally, the receiver is further configured to receive the predictive patterns from any one of a pattern database and the snapshot handler. Optionally, the information relating to the predictive pattern comprises a confidence level for the predictive pattern.

Optionally, the confidence level for the predictive pattern is increased when it is determined that the predictive pattern exists for an operation.

Optionally, the one or more parameters comprise any one of a time, a location, a device-related parameter, a user-related parameter, a service-related parameter, a subscription-related parameter, an access-related parameter and combinations thereof.

Optionally, the information relating to the predictive pattern comprises an identifier of the predictive pattern.

According to the invention in a further aspect, there is provided a snapshot handler for generating a snapshot of a distributed database, the snapshot comprising a plurality of data entries representative of corresponding data entries in the distributed database at a time that the snapshot is taken. The snapshot handler comprises a receiving means, which may be a receiver, configured to receive, from a data entry analysis module, information relating to operations that update one or more data entries in the distributed database. The received information comprises: if a predictive pattern exists for an operation, information relating to the predictive pattern, wherein the predictive pattern indicates an expected value for one or more parameters of each one or more data entries; if a predictive pattern does not exist for an operation, a replication flow, from a master node to a slave node of the distributed database, the replication flow comprising the operation that updates the one or more data entries in the distributed database and is unrelated to any predictive pattern. The snapshot handler comprises a snapshot generating means, which may be a snapshot generator, configured to: identify unchanged information for the operations; and generate a snapshot based on the predictive patterns for corresponding operations, the information unrelated to any predictive pattern for the corresponding operations and the unchanged information. Optionally, the snapshot handler further comprises a pattern learning means, which may be a pattern learning module, configured to determine a new predictive pattern for the replication flow comprising the one or more operations that update the one or more data entries, based on a historical record of the one or more data entries and the one or more parameters.

Optionally, the pattern learning module is configured to control a transmitting means, which may be a transmitter, to transmit the new predictive pattern toward any one of a pattern database and the data entry analysis module.

Optionally, the snapshot generator is further configured to store the information relating to the predictive patterns for the corresponding operations as predictable information, store the information unrelated to any predictive pattern for the corresponding operations as unpredictable information, and store the unchanged information for the operations.

Optionally, the receiver is further configured to receive a request for a snapshot from a client node and, responsive to this request, the snapshot generator is configured to generate the snapshot and control the transmitter to transmit the snapshot toward the client node.

Optionally, the receiver is further configured to receive the predictive patterns from any one of a pattern database and the data entry analysis module.

Optionally, the information relating to the predictive pattern comprises a confidence level for the predictive pattern.

Optionally, the one or more parameters comprise anyone of a time, a location, a device-related parameter, a user-related parameter, a service-related parameter, a subscription-related parameter, an access-related parameter and combinations thereof.

According to the invention in a further aspect, there is provided a method for generating a snapshot of a distributed database, the snapshot comprising a plurality of data entries representative of corresponding data entries in the distributed database, the method carried out at a data entry analysis module, which is in charge of analysing replication flows from a master node to a slave node of the distributed database. The method comprises receiving replication flows that comprise operations to update one or more data entries in the distributed database. The method comprises determining whether a predictive pattern exists for each operation, wherein the predictive pattern indicates an expected value for one or more parameters of each one or more data entries. The method comprises transmitting, toward a snapshot handler, information relating to the operations that comprises: if a predictive pattern exists for an operation, information relating to the predictive pattern; if a predictive pattern does not exist for an operation, the received replication flow that comprises the operation, which is unrelated to any predictive pattern.

Optionally, the method further comprises receiving the predictive patterns from anyone of a pattern database and the snapshot handler.

According to the invention in a further aspect, there is provided a method for generating a snapshot of a distributed database, the snapshot comprising a plurality of data entries representative of corresponding data entries in the distributed database, the method carried out at a snapshot handler, which is in charge of generating the snapshot. The method comprises receiving, from a data entry analysis module, information relating to operations that update one or more data entries in the distributed database. The received information comprises: if a predictive pattern exists for an operation, information relating to the predictive pattern, wherein the predictive pattern indicates an expected value for one or more parameters of each one or more data entries; if a predictive pattern does not exist for an operation, a replication flow, from a master node to a slave node of the distributed database, wherein the replication flow comprises the operation that updates the one or more data entries and is unrelated to any predictive pattern. The method comprises identifying unchanged information for the operations and generating a snapshot based on predictive patterns for the corresponding operations, the information unrelated to any predictive pattern for the corresponding operations and the unchanged information.

Optionally, the method further comprises determining a new predictive pattern for the replication flow, wherein the replication flow comprises the operation that updates the one or more data entries, based on a historical record of the one or more data entries and the one or more parameters.

Optionally, the method further comprises transmitting the new predictive pattern toward anyone of a pattern database and the data entry analysis module.

Optionally, the method further comprises storing the information relating to the predictive patterns for the corresponding operations as predictable information, storing the information unrelated to any predictive pattern for the corresponding operations as unpredictable information, and storing the unchanged information for the operations.

Optionally, the method further comprises receiving a request for a snapshot from a client node and, responsive to this request, generating the snapshot and transmitting the snapshot toward the client node.

Optionally, the predictive patterns are received from anyone of a pattern database and the data entry analysis module.

Optionally, the information relating to the predictive pattern comprises an identifier of the predictive pattern. Optionally, the one or more parameters comprise anyone of a time, a location, a device-related parameter, a user-related parameter, a service-related parameter, a subscription-related parameter, an access-related parameter and combinations thereof.

According to the invention in a further aspect, there is provided a computer program comprising instructions which, when executed on at least one processor, cause the at least one processor to carry out any method disclosed.

According to the invention in a further aspect, there is provided a carrier containing the computer program mentioned above, wherein the carrier is one of an electronic signal, optical signal, radio signal, or non-transitory computer readable storage medium.

Brief description of the drawings

Figure 1 is a block schematic representation of a database system;

Figure 2 is a black schematic representation of a snapshot handler;

Figure 3 is a block schematic representation of a data entry analysis module;

Figure 4 is a signalling diagram showing determination of predictive pattern;

Figure 5 is a signalling diagram showing generation of a snapshot; and

Figure 6 is a block schematic representation of a distributed database system.

Detailed description

Methods and apparatus disclosed herein provide a distributed database system operator the opportunity to generate snapshots of at least a part of the distributed database by predicting at least one of the data entries in the database based on predictive pattern and combining the predicted data entries with data that is unchanged and data that cannot be predicted, which may be received from replication flows. The predictive pattern indicates what the value of one or more data entries should be for one or more parameters, such as a time, a location, a device-related parameter, a user-related parameter, a service-related parameter, a subscription-related parameter, an access-related parameter and combinations thereof. The predictive pattern may be determined based on historical analysis of the value of the one or more data entries and the one or more parameters. It is noted that the term“value” when used in relation to database data entries is not limited to a numerical value and may be text based or take any other format depending on the data to be stored in the data entry. Referring to Figure 1 , in exemplary arrangements, an operational database infrastructure 100, which may be a distributed database, comprises a master database 102 and a slave database 104. The slave 104 corresponds to the master 102 in that it contains a copy of one or more shards (and potentially all the shards) of the master 102. The operational database infrastructure 100 further comprises a replication log 106 for detecting changes in the master 102. The master 102 is in electrical communication with the slave 104 such that information in the replication log 106 may be transferred to the slave 104.

In use, a client 108 sends an instruction and/or data to the master 102, one or more data entries of which are thereby updated. The replication log 106 detects the change in the master 102 and replication data relating to the updates of data entries in a master node 102 is transmitted to the corresponding slave node 104 in a replication flow.

The operational database infrastructure further comprises a data entry analysis module 300 that is configured to intercept data indicative of an update to the master 102, in this case replication data. The replication data may comprise the actual operations undertaken on the master 102. The data entry analysis module 300 is in electrical communication with a snapshot handler 200, which is configured to generate at least a partial snapshot of the master 102.

The snapshot handler 200 comprises a pattern learning module 202. The pattern learning module is in electrical communication with the data entry analysis module 300 and a pattern database 400. As described in detail below, the pattern learning module 202 is configured to determine new predictive pattern or update existing predictive pattern. The pattern database 1 10 stores determined predictive pattern and is also in electrical communication with the data entry analysis module 300. The snapshot handler also comprises a snapshot generator 214 for generating a snapshot of at least a part of the master 102 of the operational database infrastructure 100, and which is in electrical communication with the pattern database 110. As can be seen in Figure 1 , the snapshot may comprise data entries 216 predicted using the predictive pattern, data entries 218 that remain unchanged since a previous snapshot and data entries 220 that do not fit with a pattern and are therefore considered unpredictable. Figure 2 shows a schematic representation of a snapshot handler 200. The snapshot handler 200 may be used as shown in Figure 1 . The snapshot handler 200 comprises a transmitter 202 and a receiver 204. The transmitter 202 and receiver 204 may be in data communication with other entities in a telecommunications network and/or database infrastructure and are configured to transmit and receive data accordingly.

The snapshot handler 200 further comprises a memory 206 and a processor 208. The memory 206 may comprise a non-volatile memory and/or a volatile memory. The memory 206 may have a computer program 210 stored therein. The computer program 210 may be configured to undertake methods disclosed herein. The computer program 210 may be loaded in the memory 206 from a non-transitory computer readable medium 212, on which the computer program is stored. The processor 208 is configured to undertake one or more of the functions of a snapshot generator 214 and a pattern learning module 216, as set out below.

Each of the transmitter 202 and receiver 204, memory 206, processor 208, snapshot generator 214 and pattern learning module 216 is in data communication with the other features 202, 204, 206, 208, 210, 214, 216 of the snapshot handler 200. The snapshot handler 200 can be implemented as a combination of computer hardware and software. In particular, the snapshot generator 214 and the pattern learning module 216 may be implemented as software configured to run on the processor 208, or as combinations of hardware and software in separate modules. The memory 206 stores the various programs/executable files that are implemented by a processor 208, and also provides a storage unit for any required data. The programs/executable files stored in the memory 206, and implemented by the processor 208, can include the snapshot generator 214 and the pattern learning module 216, but are not limited to such.

Figure 3 shows a schematic representation of a data entry analysis module 300. The data entry analysis module 300 may be used as shown in Figure 1 . The data entry analysis module 300 comprises a transmitter 302 and a receiver 304. The transmitter 302 and receiver 304 may be in data communication with other entities in a telecommunications network and/or database infrastructure and are configured to transmit and receive data accordingly. The data entry analysis module 300 further comprises a memory 306 and a processor 308. The memory 306 may comprise a non-volatile memory and/or a volatile memory. The memory 306 may have a computer program 310 stored therein. The computer program 310 may be configured to undertake methods disclosed herein. The computer program 310 may be loaded in the memory 306 from a non-transitory computer readable medium 312, on which the computer program is stored. The processor 308 is configured to undertake the functions of a pattern analyzer 314, as set out below.

Each of the transmitter 302 and receiver 304, memory 306, processor 308 and pattern analyzer 314 is in data communication with the other features 302, 304, 306, 308, 310, 314 of the data entry analysis module 300. The data entry analysis module 300 can be implemented as a combination of computer hardware and software. In particular, the pattern analyzer 314 may be implemented as software configured to run on the processor 308, or as combinations of hardware and software in separate modules. The memory 306 stores the various programs/executable files that are implemented by a processor 308, and also provides a storage unit for any required data. The programs/executable files stored in the memory 306, and implemented by the processor 308, can include the pattern analyzer 314, but are not limited to such.

A database may comprise a distributed database system, or other type of database or data storage system. The database comprises a plurality of data entries, wherein each data entry has a value. This value may be any type of content, including numerical content, textual content, audio content, visual content (e.g. one or more images, video, etc.), or any other type of content, or a combination of multiple types of content.

Methods and apparatus disclosed herein propose detecting predictive patterns that describe how and when the data entries in the master database 102 change. The predictive patterns may be determined based on historical values of the data entries and one or more parameters associated with the data entries, such as time, location and any others mentioned herein. The predictive patterns may be saved as predictive pattern in the pattern database and used to predict the data entries when generating a snapshot. As mentioned above, the distributed database system comprises a data entry analysis module that is devoted to capturing (or receiving) data indicative of the updates or changes in data entries introduced by the client application 108 of the database system 100 and check if they follow any existing predictive pattern that describes how and when the data entry is expected to change. In addition, the snapshot handler 200 is responsible for handling the data needed to generate a new snapshot (or partial snapshot) whenever it is requested or scheduled. The snapshot handler 200 is also responsible for detecting and determining new predictive pattern or updating existing predictive pattern based on data transmitted by the aforementioned data entry analysis module 300, which is indicative of updates to the master 102, and historical records about how the data entry has changed in the past. The snapshot handler 200 also stores any new data for data entries that do not follow any pattern. Snapshots may be generated by combining predicted data entries using predictive pattern, and data entries that do not follow any pattern. When new predictive pattern is generated, this is available to the data entry analysis module 300 for use in determining whether valid predictive pattern exists for a given database update. This avoids transmission of update data towards the snapshot handler 200 if the predictive pattern describes the associated update.

As shown in Figure 1 , exemplary methods and apparatus include a snapshot handler 200 and a data entry analysis module 300. The data entry analysis module 300 is configured to receive data indicative of an update (replication data) to one or more data entries in the master database 102. The pattern analyzer 314 may then determine the update that has occurred and determine whether the update follows any pre-existing predictive pattern, in which case that predictive pattern is determined to be valid for the update. If there is no valid predictive pattern the pattern analyzer 314 may control the transmitter 302 to send the replication data out to the pattern learning module 216 of the snapshot handler 200. The snapshot handler 200 is configured to manage requests for snapshots coming from client applications and also scheduled snapshots, and to provide the snapshot in the correct format to a requester. The requested data may involve a full snapshot or a partial snapshot - the term “snapshot” is used herein to encompass either a full snapshot or a partial snapshot.

The snapshot handler 200 comprises the pattern learning module 216 and the snapshot generator 214. The pattern learning module 216 is configured to detect, update and/or delete predictive pattern based on its validity. The pattern learning module 216 may also be configured to store the predictive pattern (e.g. in the pattern database 1 10), manage the predictive pattern, and keep the update data that is received from the data entry analysis module 300. The snapshot generator 214 it is configured to receive requests from any application requiring a new snapshot and provide the information in the requested format. The snapshot generator 214 is also configured to store the data entries that it predicts based on the predictive pattern, joining such predicted data entries 216 with the data entries 218 that remain unchanged since the last update data was received and also with the data entries 220 that do not follow any pattern. The snapshot generator 214 may also manage communication with the pattern learning module 216 and/or the pattern database 1 10 to obtain recent updates in the existing predictive pattern and with the data entry analysis module 300 to obtain update data that does not follow any predictive pattern. All these communications may be made over a network.

The snapshot generator 214 can determine a consistent snapshot by applying the predictive pattern available given a particular parameter, such as the time the snapshot is requested. This may be done without synchronizing with the pattern learning module 216. Some data entries have associated predictive pattern that represents a time when the data entry is likely to change or represents an expected value for the data entry at a given time. Following such predictive pattern, the associated data entries are predicted and used for providing the snapshot, making it unnecessary to update those data entries in a snapshot based on, for example, the replication data.

The functionality of the snapshot handler 200 and the data entry analysis module 300 are described in more in detail below. The data entry analysis module 300 may be embodied in a computer system comprising known elements, e.g. processor, memory, persistent storage, and network interface, configured to:

• Initiate and terminate a network connection for replication data flows from the master database 102 towards the slave 104, thereby receiving the replication data;

• Search the replication data looking into the updating messages;

• Decode replication data received from the master 102, determine further replication messages and then transmit them to the slave 102;

• Extract from the replication data information about what data entries have been updated and what new values they take (if appropriate) from the replication messages received from the master 102; and/or

• Examine the received replication data and check whether it follows pre-existing predictive pattern or not: - If the received replication data follows existing predictive pattern, the pattern analyzer 314 may be configured to increment a confidence level in the predictive pattern. In one example, this may comprise incrementing the number of matches of received replication data for the predictive pattern. The pattern analyzer 314 may then control the transmitter 302 to transmit data indicating that valid predictive pattern exists, which may include transmitting the increased confidence level, towards the snapshot handler 200. In exemplary arrangements, a predictive pattern may be used only if it has a confidence level above a threshold value.

- If the received data does not follow any existing predictive pattern, the pattern analyzer 314 may then control the transmitter 302 to transmit the replication data (or data describing the update) to the snapshot handler 200 for processing by the pattern learning module 216. In order to determine whether the received replication data matches existing predictive pattern, the receiver 304 may be configured to receive predictive pattern or updates to existing predictive pattern from the pattern database 1 10 or the snapshot handler. The snapshot handler 200 may be embodied in a computer system comprising known elements, e.g. processor, memory, persistent storage, and network interface, configured to:

• Initiate and terminate a network connection towards the data entry analysis module 300, thereby receiving the replication data or an identifier and the confidence level related to the predictive pattern;

• Store data unrelated to any existing predictive pattern, which may be used by the pattern learning module 216 for determining new predictive pattern or updating existing predictive pattern;

• Store the confidence level relating to existing predictive pattern;

• Store data entries for which the update data has been received and for which there is no valid existing predictive pattern.

• Store the data that follow existing predictive pattern and associate to them the related predictive pattern that predict when such group of data is likely to change (or in other words, when such data is not guaranteed to be still valid due to the knowledge acquired whilst learning and updating the patterns); • Store and update the predictive pattern that will allow the prediction of the data entries each of them describes;

• Receive a request to generate a partial or complete snapshot, combining data entries that have been stored and are labelled as unpredictable with data entries that are valid according to stored predictive pattern at the time the snapshot is requested and also with data entries generated by prediction based on the predictive pattern; and/or

• Release the snapshot in the requested format.

Figure 4 shows a signalling diagram of an exemplary method for determining valid predictive pattern that describes how and when data entries within a database may change.

• A client 108 performs 400 an update operation, causing one or more data entries within a master database 102 to be updated. The one or more updates may comprise an update of the value of one or more data entries, and may also comprise changes to the schema of the master database 102.

• After performing the one or more updates to the data entries comprised within master database 102, the master database 102 transmits 402 replication data identifying the updates to the data entries applied to the master node 102. In known systems, this data is used to replicate the master 102 in the slave 104.

• The replication data 108 is received by the receiver 306 of data entry analysis module 300 and transmitted 404 to a slave node 104.

• The pattern analyzer 314 of the data entry analysis module 300, analyzes the received replication data to determine 406 whether valid predictive pattern exists for the one or more data entries, the predictive pattern indicates an expected value for one or more parameters relating to the one or more data entries.

• In the case of step 406, the pattern analyzer 314 determines that valid predictive pattern exists for the updated data entry. That is, the update to the data entry follows a prediction that would have been made based on the predictive pattern. Therefore, the pattern analyzer 314 updates the confidence level, e.g. by increasing a matching statistic, and controls the transmitter 302 to transmit 408 the updated confidence level to the snapshot handler 200 optionally along with an identifier of the predictive pattern. The received replication data need not be forwarded to the snapshot handler 200. • The pattern learning module 216 of the snapshot handler 200 may store the confidence level for the predictive pattern relating to the data entry in the pattern database 400, such that it can be associated therewith.

• Sometime later, the client 108 may perform 410 a further update to one or more data entries in the master database 102.

• After performing the one or more updates to the data entries comprised within master database 102, the master database 102 transmits 412 towards the slave database 104 replication data identifying the updates to the data entries applied in the master node 102.

• The replication data is received by the receiver 304 of the data entry analysis module 300 and is transmitted 414 to the slave database 104.

• The pattern analyzer 314 of the data entry analysis module 300, analyzes the received replication data to determine 416 whether valid predictive pattern exists for the one or more data entries, the predictive pattern indicating an expected value for one or more parameters relating to the one or more data entries.

• In the case of step 416, the pattern analyzer 314 determines that no predictive pattern exists for the updated data entry. That is, the update to the data entry does not follow a prediction that would have been made based on any predictive pattern or no predictive pattern exists that relates to that data entry.

• Therefore, the pattern analyzer 314 controls the transmitter 302 to transmit 418 towards the snapshot handler 200 information indicative of the replication data.

• The pattern learning module 216 of the snapshot handler 200 analyzes the information received from the data entry analysis module 300 to determine 420 whether valid predictive pattern should be generated. The determination of valid predictive pattern may comprise determining a new predictive pattern relating to the data entry and/or updating an existing, but invalid, predictive pattern relating to the data entry. The determination of whether a valid predictive pattern can be generated may be based on a historical record of values of one or more parameters for the data entry. For example, at a particular time or when a user of a mobile device is in a particular location, other parameters may be determined. In a further example, a particular operation may be predicted to follow another operation or sequence of operations.

• If a predictive pattern is generated, the snapshot handler 200 may store the predictive pattern for use by the data entry analysis module 300. The snapshot handler 200 may therefore transmit the predictive pattern to the pattern database 400 and/or may transmit 422 the valid predictive pattern to the data entry analysis module 300.

• If no valid predictive pattern may be determined, the pattern learning module 216 may label the update to the data entry as unpredictable data 220 and store it as such. The unpredictable data entries 220 may be combined with predicted data entries when generating a snapshot.

Figure 5 shows a signalling diagram of an exemplary method for generating a snapshot of a database 102 based on predictive pattern.

• A client 108, which may or may not be the same client 108 of Figures 1 and 4, transmits 500 a request to generate a new snapshot, which is received by the snapshot handler 200.

• The snapshot generator 214 is configured to generate a snapshot according to the request. This may comprise determining 502 whether any currently valid predictive pattern exists that relates to one or more data entries in the master database 102 and predicting one or more data entries of the master database 102 based on the predictive pattern.

• In exemplary methods and apparatus, the snapshot generator 214 may control the transmitter 202 to transmit 504 to the data entry analysis module 300 a request for current values of data entries relating to any unpredictable data 220. In other exemplary methods and apparatus, the unpredictable data 220 may have been stored after it was received from the data entry analysis module 300, as explained above.

• The data entry analysis module 300 transmits 506 the request to the master database 102 and receives 508 current data entries related to unpredictable data in response, which it then transmits 510 to the snapshot handler 200.

• The data entries relating to the unpredictable data 220 are received by the snapshot handler 200 and combined with the previously predicted data entries and the data entries that remain unchanged and the snapshot is generated 512.

• The generated snapshot is transmitted 514 to the client 108.

As shown in Figure 6, the methods and apparatus disclosed herein may be deployed in a distributed database, although they are not limited to such. In the example of Figure 6, the data entry analysis module 300 and the snapshot handler 200 are connected by a network and can be placed in separated nodes or in the same node. If there is limited availability of resources in a master database 102 or slave database 104, each or both of the data entry analysis module 300 and the snapshot handler 200 can be placed in other nodes, otherwise one or even both can be located into the same node. It will vary depending on the case, the frequency at which snapshots are requested, data update rate, load of the master 102 and slave 104, etc.

There now follows an example of the operation of methods and apparatus disclosed herein. The example database has only one table with the following records. For clarity, a pseudocode is used to represent the data structure.

The example presents a case that records the changes in location and/or the kind of device used for accessing a telecommunications network. Whenever there is a change in any of the values of these data entries, it is issued towards the master database 102 as an update operation providing the new location and/or the new device. The update operation does not necessarily mean a change in the values of the data entries, the update operation may only mean an update on the current subscriber location and device but they can be with the same data as in the previous updates operation (no change in location and/or device). In normal conditions, these changes regardless of whether they imply a real change in the data or not, are replicated towards the corresponding slave(s) 104 and will be taken as part of the snapshot if such a snapshot is requested at that time.

The master database 102 content at a given time is shown below. For simplicity, the example considers only 3 possible locations and 3 possible types of devices. They are represented as L1 , L2 and L3 for each possible location, and as D1 , D2 and D3 for each possible device. The subscriber identity is unique for each subscriber and it does not change unless it is deleted from the database 102.

This is also how a snapshot of the database looks like at the current time.

The analysis on how the data entries of the database 102 evolve through time (or based on another parameter, such as location) may allow the determination of a predictive pattern identifying an expected value for one or more parameters relating to the one or more data entries. It may also allow the determination of groups of subscribers that follow similar behavior over similar time periods. Following one or more update operations, the data entries for the database are as follows:

In known systems, all these changes are captured in the database 102 and replication data is transmitted to the slave 104. If a new snapshot is requested, all changes since last snapshot was created will be forwarded to the snapshot handler 200, analyzed and it will be determined whether they should be included in the new snapshot (updating the previous one) or not. According to exemplary arrangements disclosed herein, when the changes are replicated towards the slave 104, they are first captured and analyzed by the data entry analysis module 300. Due to the lack of any existing predictive pattern (e.g. it is a clean start of the system and no predictive pattern exists) replication data corresponding to all subscribers will be propagated to the snapshot handler 200. The pattern learning module 216 of the snapshot handler 200 undertakes a learning phase in which it records the changes in subscribers 1 and 3 are changes in some of the values of the data entries and the rest of subscribers as unchanged data entries.

After some time, there are some further updates to data entries in the database 102. Consider as an illustrative example the following changes:

When the corresponding replication data is transmitted to the snapshot handler 200, a possible conclusion from the pattern learning module 216 would be that subscriber 3 uses a different device depending on where he/she is located, and it is also possible to derive for how long such state remains unchanged. With this information in place, it is possible to generate a predictive pattern that says that subscriber 3 when it is located in L2 uses D1 and in L3 uses D2. By means of this predictive pattern, only the change in location shall be replicated to know with high certainty what device the subscriber is using according to his/her location. Another pattern at this point may predict that subscribers 2 and 5 remain unchanged and their associated data will not be replicated until a change in any of the values is made.

The pattern learning module 216 may store the predictive pattern. In case a snapshot is requested, the previously learned patterns may be used by the snapshot generator 214 for predicting the data entries they are associated with.

In exemplary arrangements, a verification process may be run by the pattern learning module 216, which checks the validity of the existing predictive pattern, taking a full refresh of the data entries of the database 102. This may be used in case accuracy of the data is required to be very high.

A computer program may be configured to provide any of the above described methods. The computer program may be provided on a computer readable medium. The computer program may be a computer program product. The product may comprise a non-transitory computer usable storage medium. The computer program product may have computer-readable program code embodied in the medium configured to perform the method. The computer program product may be configured to cause at least one processor to perform some or all of the method.

Various methods and apparatus are described herein with reference to block diagrams or flowchart illustrations of computer-implemented methods, apparatus (systems and/or devices) and/or computer program products. It is understood that a block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by computer program instructions that are performed by one or more computer circuits. These computer program instructions may be provided to a processor circuit of a general purpose computer circuit, special purpose computer circuit, and/or other programmable data processing circuit to produce a machine, such that the instructions, which execute via the processor of the computer and/or other programmable data processing apparatus, transform and control transistors, values stored in memory locations, and other hardware components within such circuitry to implement the functions/acts specified in the block diagrams and/or flowchart block or blocks, and thereby create means (functionality) and/or structure for implementing the functions/acts specified in the block diagrams and/or flowchart block(s).

Computer program instructions may also be stored in a computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer- readable medium produce an article of manufacture including instructions which implement the functions/acts specified in the block diagrams and/or flowchart block or blocks.

A tangible, non-transitory computer-readable medium may include an electronic, magnetic, optical, electromagnetic, or semiconductor data storage system, apparatus, or device. More specific examples of the computer-readable medium would include the following: a portable computer diskette, a random access memory (RAM) circuit, a read-only memory (ROM) circuit, an erasable programmable read-only memory (EPROM or Flash memory) circuit, a portable compact disc read-only memory (CD- ROM), and a portable digital video disc read-only memory (DVD/Blu-ray).

The computer program instructions may also be loaded onto a computer and/or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer and/or other programmable apparatus to produce a computer-implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions/acts specified in the block diagrams and/or flowchart block or blocks. Accordingly, the invention may be embodied in hardware and/or in software (including firmware, resident software, micro-code, etc.) that runs on a processor, which may collectively be referred to as "circuitry," "a module" or variants thereof. It should also be noted that in some alternate implementations, the functions/acts noted in the blocks may occur out of the order noted in the flowcharts. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Moreover, the functionality of a given block of the flowcharts and/or block diagrams may be separated into multiple blocks and/or the functionality of two or more blocks of the flowcharts and/or block diagrams may be at least partially integrated. Finally, other blocks may be added/inserted between the blocks that are illustrated. The skilled person will be able to envisage other embodiments without departing from the scope of the appended claims.

Claims

CLAIMS:

1. A data entry analysis module (300) for analysing replication flows (106) from a master node (102) to a slave node (104) of a distributed database (100) and comprising:

a receiver (304) configured to receive replication flows comprising one or more operations that update one or more data entries in the distributed database (100) ;

a pattern analyser (314) configured to determine whether a predictive pattern exists for each of the one or more operations, wherein the predictive pattern indicates an expected value for one or more parameters of each of the one or more data entries;

and wherein the pattern analyser (314) is further configured to control a transmitter (302) to transmit, toward a snapshot handler (200), information relating to the operations that comprises:

if a predictive pattern exists for an operation, information relating to the predictive pattern;

if a predictive pattern does not exist for an operation, the received replication flow comprising the operation, which is unrelated to any predictive pattern.

2. The data entry analysis module (300) of claim 1 , wherein the receiver (304) is further configured to receive the predictive patterns from any one of a pattern database (1 10) and the snapshot handler (200).

3. The data entry analysis module (300) of any of claims 1 or 2, wherein the information relating to the predictive pattern comprises a confidence level for the predictive pattern.

4. The data entry analysis module (300) of claim 3, wherein the confidence level for the predictive pattern is increased when it is determined that the predictive pattern exists for an operation.

5. The data entry analysis module (300) of any of claims 1 to 4, wherein the one or more parameters comprise any one of a time, a location, a device-related parameter, a user-related parameter, a service-related parameter, a subscription-related parameter, an access-related parameter and combinations thereof.

6. The data entry analysis module (300) of any of claims 1 to 5, wherein the information relating to the predictive pattern comprises an identifier of the predictive pattern.

7. A snapshot handler (200) for generating a snapshot of a distributed database (100), the snapshot comprising a plurality of data entries representative of corresponding data entries in the distributed database (100) at a time that the snapshot is taken, the snapshot handler comprising:

a receiver (204) configured to receive, from a data entry analysis module (300), information relating to operations that update one or more data entries in the distributed database (100), wherein the received information comprises:

if a predictive pattern exists for an operation, information relating to the predictive pattern, wherein the predictive pattern indicates an expected value for one or more parameters of each of the one or more data entries; if a predictive pattern does not exist for an operation, a replication flow, from a master node (102) to a slave node (104) of the distributed database (100), the replication flow comprising the operation that updates the one or more data entries in the distributed database and is unrelated to any predictive pattern;

a snapshot generator (214) configured to:

identify unchanged information (218) for the operations; and generate a snapshot based on the predictive patterns for corresponding operations, the information unrelated to any predictive pattern for the corresponding operations and the unchanged information.

8. The snapshot handler (200) of claim 7 further comprising a pattern learning module (216) configured to determine a new predictive pattern for the replication flow comprising the one or more operations that update the one or more data entries, based on a historical record of the one or more data entries and the one or more parameters.

9. The snapshot handler (200) of claim 8, wherein the pattern learning module (216) is configured to control a transmitter (202) to transmit the new predictive pattern toward any one of a pattern database (1 10) and the data entry analysis module (300).

10. The snapshot handler (200) of any of claims 7 to 9, wherein the snapshot generator (214) is further configured to store the information relating to the predictive patterns for the corresponding operations as predictable information (216), store the information unrelated to any predictive pattern for the corresponding operations as unpredictable information (220), and store the unchanged information (218) for the operations.

1 1. The snapshot handler (200) of any of claims 7 to 10, wherein the receiver (204) is further configured to receive a request for a snapshot from a client node (108) and, responsive to this request, the snapshot generator (214) is configured to generate the snapshot and control the transmitter (202) to transmit the snapshot toward the client node (108).

12. The snapshot handler (200) of any of claims 7 to 11 , wherein the receiver (204) is further configured to receive the predictive patterns from any one of a pattern database (110) and the data entry analysis module (300).

13. The snapshot handler (200) of any of claims 7 to 12, wherein the information relating to the predictive pattern comprises a confidence level for the predictive pattern.

14. The snapshot handler (200) of any of claims 7 to 13, wherein the information relating to the predictive pattern comprises an identifier of the predictive pattern.

15. The snapshot handler (200) of any of claims 7 to 14, wherein the one or more parameters comprise anyone of a time, a location, a device-related parameter, a user- related parameter, a service-related parameter, a subscription-related parameter, an access-related parameter and combinations thereof.

16. A method for generating a snapshot of a distributed database (100), the snapshot comprising a plurality of data entries representative of corresponding data entries in the distributed database (100), the method carried out at a data entry analysis module (300), which is in charge of analysing replication flows (106) from a master node (102) to a slave node (104) of the distributed database, and comprising:

receiving replication flows that comprise operations to update one or more data entries in the distributed database (100);

determining whether a predictive pattern exists for each operation, wherein the predictive pattern indicates an expected value for one or more parameters of each of the one or more data entries; and

transmitting, toward a snapshot handler (200), information relating to the operations that comprises:

if a predictive pattern does not exist for an operation, the received replication flow that comprises the operation, which is unrelated to any predictive pattern.

17. The method of claim 16, wherein it further comprises receiving the predictive patterns from anyone of a pattern data base (1 10) and the snapshot handler (200).

18. The method of any of claims 16 or 17, wherein the information relating to the predictive pattern comprises a confidence level for the predictive pattern.

19. The method of claim 18, wherein the confidence level for the predictive pattern is increased when it is determined that the predictive pattern exists for an operation.

20. The method of any of claims 16 to 19, wherein the one or more parameters comprise anyone of a time, a location, a device-related parameter, a user-related parameter, a service-related parameter, a subscription-related parameter, an access- related parameter and combinations thereof.

21 . The method of any of claims 16 to 20, wherein the information relating to the predictive pattern comprises an identifier of the predictive pattern.

22. A method for generating a snapshot of a distributed database (100), the snapshot comprising a plurality of data entries representative of corresponding data entries in the distributed database (100), the method carried out at a snapshot handler (200), which is in charge of generating the snapshot, and comprising:

receiving, from a data entry analysis module (300), information relating to operations that update one or more data entries in the distributed database (100), wherein the received information comprises:

if a predictive pattern exists for an operation, information relating to the predictive pattern, wherein the predictive pattern indicates an expected value for one or more parameters of each of the one or more data entries; if a predictive pattern does not exist for an operation, a replication flow, from a master node (102) to a slave node (104) of the distributed database (100), wherein the replication flow comprises the operation that updates the one or more data entries and is unrelated to any predictive pattern;

identifying unchanged information (218) for the operations; and generating a snapshot based on the predictive patterns for the corresponding operations, the information unrelated to any predictive pattern for the corresponding operations and the unchanged information.

23. The method of claim 22, wherein it further comprises determining a new predictive pattern for the replication flow, wherein the replication flow comprises the operation that updates the one or more data entries, based on a historical record of the one or more data entries and the one or more parameters.

24. The method of claim 23, wherein it further comprises transmitting the new predictive pattern toward anyone of a pattern database (1 10) and the data entry analysis module (300).

25. The method of any of claims 22 to 24, wherein it further comprises storing the information relating to the predictive patterns for the corresponding operations as predictable information (216), storing the information unrelated to any predictive pattern for the corresponding operations as unpredictable information (220), and storing the unchanged information (218) for the operations.

26. The method of any of claims 22 to 25, wherein it further comprises receiving a request for a snapshot from a client node (108) and, responsive to this request, generating the snapshot and transmitting the snapshot toward the client node (108).

27. The method of any of claims 22 to 26, wherein the predictive patterns are received from anyone of a pattern database (1 10) and the data entry analysis module (300).

28. The method of any of claims 22 to 27, wherein the information relating to the predictive pattern comprises a confidence level for the predictive pattern.

29. The method of any of claims 22 to 28, wherein the information relating to the predictive pattern comprises an identifier of the predictive pattern.

30. The method of any of claims 22 to 29, wherein the one or more parameters comprise anyone of a time, a location, a device-related parameter, a user-related parameter, a service-related parameter, a subscription-related parameter, an access- related parameter and combinations thereof.

31 . A computer program comprising instructions which, when executed on at least one processor, cause the at least one processor to carry out the method according to any of claims 16 to 30.

32. A carrier containing the computer program of claim 31 , wherein the carrier is one of an electronic signal, optical signal, radio signal, or non-transitory computer readable storage medium.