WO2021010896A1

WO2021010896A1 - Method and system for distributed data management

Info

Publication number: WO2021010896A1
Application number: PCT/SG2020/050404
Authority: WO
Inventors: Jenn Bing ONG; Wee Keong NG
Original assignee: Nanyang Technological University
Priority date: 2019-07-12
Filing date: 2020-07-13
Publication date: 2021-01-21

Abstract

There is provided a method of distributed data management, including: decomposing, at a source computing node of the at least one processor, data randomly into a plurality of randomized tensor blocks based on tensor network decomposition; transmitting, at the source computing node, the plurality of randomized tensor blocks to a plurality of distributed computing nodes, respectively; and storing, at a memory associated with the source computing node, metadata associated with the plurality of randomized tensor blocks, the metadata comprising, for each of the plurality of randomized tensor blocks, identity information and location information relating to the randomized tensor block. There is also provided a corresponding system for distributed data management and a corresponding network system comprising the system for distributed data management and the plurality of distributed computing nodes.

Description

METHOD AND SYSTEM FOR DISTRIBUTED DATA MANAGEMENT CROSS-REFERENCE TO RELATED APPLICATION

[0001] This application claims the benefit of priority of Singapore Patent Application No. 10201906493S, filed on 12 July 2019, the content of which being hereby incorporated by reference in its entirety for all purposes. TECHNICAL FIELD

[0002] The present invention generally relates to a method of distributed data management and a system thereof, and more particularly, in relation to big data or big and complex data. BACKGROUND

[0003] Security is an important issue for organizations and enterprises to outsource data storage, sharing, and computation on clouds/fogs. However, data encryption is complicated in terms of the key management and distribution, and existing secure computation techniques are expensive in terms of computational/communication cost, therefore do not scale to big data computation.

[0004] For example, the development of cutting-edge artificial intelligence (AI) systems requires lots of data collected from sensors and Internet-of-Thing (IoT) devices to achieve high performance in many tasks ranging from business decision-making to personalized services for end customers. Big data requires storage and processing power beyond traditional computing resources, therefore enterprises and government bodies undergoing digital transformation often need to extend their computing capability to the cloud and mobile environments. However, big data may contain sensitive information that can be exploited once placed in the public cloud resources. For example, videos taken by a network of surveillance camera may contain personal information such as individuals’ locations and preferences. As has been reported in the art, the growing attack surfaces from digital transformation calls for simpler data- security solutions such as encryption and access control. However, classical cybersecurity solutions such as end-point security, network security, and digital vault are not scalable and not cost-effective to protect data privacy and security.

[0005] A need therefore exists to provide a method of distributed data management and a system thereof, and more particularly, in relation to big data or big and complex data, that seek to overcome, or at least ameliorate, one or more of the deficiencies in conventional method and system for distributed data management, such as but not limited to, improving or preserving data privacy and security in an effective and/or efficient manner. It is against this background that the present invention has been developed. SUMMARY

[0006] According to a first aspect of the present invention, there is provided a method of distributed data management using at least one processor, the method comprising: decomposing, at a source computing node of the at least one processor, data randomly into a plurality of randomized tensor blocks based on tensor network decomposition; transmitting, at the source computing node, the plurality of randomized tensor blocks to a plurality of distributed computing nodes, respectively; and storing, at a memory associated with the source computing node, metadata associated with the plurality of randomized tensor blocks, the metadata comprising, for each of the plurality of randomized tensor blocks, identity information and location information relating to the randomized tensor block.

[0007] According to a second aspect of the present invention, there is provided a system for distributed data management comprising: a memory; and at least one processor communicatively coupled to the memory and configured to: decompose, at a source computing node of the at least one processor, data randomly into a plurality of randomized tensor blocks based on tensor network decomposition; transmit, at the source computing node, the plurality of randomized tensor blocks to a plurality of distributed computing nodes, respectively; and store, at the memory associated with the source computing node, metadata associated with the plurality of randomized tensor blocks, the metadata comprising, for each of the plurality of randomized tensor blocks identity information and location information relating to the randomized tensor block.

[0008] According to a third aspect of the present invention, there is provided a network system for distributed data management, the network system comprising: a plurality of distributed servers comprising a plurality of distributed computing nodes, respectively, each distributed server comprising: a memory; and at least one processor communicatively coupled to the memory and comprises the corresponding distributed computing node; and a system for distributed data management according to the second aspect of the present invention. [0009] According to a fourth aspect of the present invention, there is provided a computer program product, embodied in one or more non-transitory computer-readable storage mediums, comprising instructions executable by at least one processor to perform a method of distributed data management according to the first aspect of the present invention. BRIEF DESCRIPTION OF THE DRAWINGS

[0010] Embodiments of the present invention will be better understood and readily apparent to one of ordinary skill in the art from the following written description, by way of example only, and in conjunction with the drawings, in which:

FIG. 1 depicts a schematic flow diagram of a method of distributed data management using at least one processor according to various embodiments of the present invention;

FIG. 2 depicts a schematic block diagram of a system for distributed data management according to various embodiments of the present invention, such as corresponding to the method of distributed data management as described with reference to FIG.1;

FIG. 3 depicts a schematic block diagram of an exemplary computer system which the system for distributed data management according to various embodiments of the present invention may be embodied as;

FIG. 4 depicts a schematic block diagram of a network system for distributed data management, according to various embodiments of the present invention;

FIGs. 5A and 5B depict tables and diagrams systematically comparing the distributed data management method according to various example embodiments of the present invention against existing data-security solutions based on technical parameters;

FIGs.6A, 6B and 7 depict example conventional tensor network decomposition techniques, along with their basic tensor network formats;

FIG. 8 shows an example of distributed/dispersed computation that can be performed with distributed tensor network operations;

FIG. 9 depicts an example distributed data management method in relation to big data at-rest and in-transit security, according to various example embodiments of the present invention; FIG.10 depicts an example distributed data management method in relation to secure big data sharing, according to various example embodiments of the present invention;

FIG.11 depicts an example distributed data management method in relation to privacy-preserving big data computation, according to various example embodiments of the present invention;

FIG.12 depicts an example distributed data management method in relation to secure multi-party computation, according to various example embodiments of the present invention;

FIG. 13 depicts an example randomization algorithm (randomized TT-SVD algorithm or Algorithm 1), according to various example embodiments of the present invention;

FIG.14 depicts a graphical representation of the example randomized TT-SVD algorithm (Algorithm 1) shown in FIG.1, according to various example embodiments of the present invention;

FIG.15 depicts an example randomization algorithm (randomized TT-rounding algorithm or Algorithm 2), according to various example embodiments of the present invention;

FIG. 16 depicts an example randomization algorithm (randomized TT incremental update algorithm or Algorithm 3), according to various example embodiments of the present invention;

FIGs. 17 to 25 show example randomized tensor network decompositions of various types of data, e.g., image, audio, video, sensors, graph, and textual data, according to various example embodiments of the present invention;

FIG.26 shows the image distortion as a result of adding noise into randomly- selected tensor block of different tensor network decomposition, according to various example embodiments of the present invention;

FIG.27 shows the normalized mutual information (NMI) between tensor blocks that belong to particular image (top row), two different images (bottom row), and random noise (“rand”), according to various example embodiments of the present invention;

FIG. 28 depicts a table to benchmark the tensor network decomposition/reconstruction efficiency (in milliseconds) for different datasets, compression ratio using tensor network compression and quantization, and the classification accuracy of a convolutional neural network using the compressed data; and

FIG. 29 shows an example architecture of an example secret sharing method based on distributed tensor network, according to various example embodiments of the present invention. DETAILED DESCRIPTION

[0011] Various embodiments of the present invention provide a method of distributed data management and a system thereof, and more particularly, in relation to big data, that seek to overcome, or at least ameliorate, one or more of the deficiencies in conventional method and system for distributed data management, such as but not limited to, improving or preserving data privacy and security in an effective and/or efficient manner.

[0012] FIG.1 depicts a schematic flow diagram of a method 100 of distributed data management using at least one processor according to various embodiments of the present invention. The method 100 comprises: decomposing (at 102), at a source computing node of the at least one processor, data randomly into a plurality of randomized tensor blocks based on tensor network decomposition; transmitting (at 104), at the source computing node, the plurality of randomized tensor blocks to a plurality of distributed computing nodes, respectively; and storing (at 106), at a memory associated with the source computing node, metadata associated with the plurality of randomized tensor blocks, the metadata comprising, for each of the plurality of randomized tensor blocks, identity information and location information relating to (or associated with) the randomized tensor block.

[0013] In various embodiments, the above-mentioned data randomly decomposed into the plurality of randomized tensor blocks may be big data or big and complex data. In various embodiments, the above-mentioned data may be a data chunk or a data block, such as a data chunk or a data block of the big data or big and complex data. In various embodiments, the above-mentioned data may refer to a particular data, such as in relation to a particular record, and may have a particular or unique filename. By way of an example only and without limitation, the particular data may include information where privacy is required or desired to be preserved, such as but not limited to, proprietary information, confidential information or personal information. [0014] In various embodiments, the source computing node may refer to the computing node considered as or referred to as a source or an origin in relation to the above-mentioned data which is decomposed randomly and in relation to the above- mentioned tensor network decomposition. For example, the above-mentioned data may be related to or associated with an owner of the above-mentioned data (i.e., the data owner). In various embodiments, the plurality of distributed computing nodes may refer to the computing nodes which the plurality of randomized tensor blocks are transmitted or distributed to. For example, the plurality of distributed computing nodes may be a plurality of virtual instances (e.g., at the same processor or at different processors), at a plurality of devices (e.g., computing devices), at a plurality of servers (e.g., cloud servers), and so on.

[0015] In various embodiments, in relation to 102, decomposing the above- mentioned data randomly into the plurality of randomized tensor blocks based on tensor network decomposition may refer to any randomization (or any randomization technique) in relation to the tensor network decomposition such that the plurality of randomized tensor blocks are obtained. For example, randomization in relation to the tensor network decomposition may include randomized initialization, randomized hyperparameters selection, randomized sampling, randomized updating rate (or learning rate) of each entry in the tensor blocks, randomized mapping algorithms and so on.

[0016] In various embodiments, in relation to 102, one or more (e.g., all) of the plurality of randomized tensor blocks may each include a group or set of randomized tensor blocks (or randomized tensor sub-blocks). In other words, one or multiple randomized tensor blocks may be generated and grouped or considered as a randomized tensor block as a whole for transmission to one of the plurality of distributed computing nodes.

[0017] In various embodiments, in relation to 106, the identity information relating to the corresponding randomized tensor block may be any information (or data) configured for identifying the corresponding randomized tensor block, such as but not limited to, a filename. In various embodiments, the location information relating to the corresponding randomized tensor block may by any information (or data) configured to locating the randomized tensor block, such as but not limited to, an address (e.g., an IP address) of the distributed computing node which the randomized tensor block is assigned to be transmitted to and/or is transmitted to. [0018] Accordingly, based on the method 100 of distributed data management, the above-mentioned data is decomposed randomly into a plurality of randomized tensor blocks and transmitted to a plurality of distributed computing nodes, while the metadata comprising identity information and location information relating to the plurality of randomized tensor blocks included in metadata stored at a memory associated with the source computing node, data privacy and security in relation to the above-mentioned data is advantageously preserved or improved in an effective and/or efficient manner. These and other technical advantages will become more evident or apparent to a person skilled in the art as the method 100 of distributed data management is described in more detail according to various embodiments or example embodiments of the present invention.

[0019] In various embodiments, the tensor network decomposition is based on singular value decomposition (SVD), and the above-mentioned data is randomly decomposed into the plurality of randomized tensor blocks based on perturbation, and more particularly, a perturbation vector configured to randomly distribute singular values associated with the SVD in relation to the plurality of randomized tensor blocks.

[0020] In various embodiments, the plurality of randomized tensor blocks are each compressed based on one or more coding techniques or schemes. For example, quantization may be a lossy compression process of mapping an input value to a number of representative output values which have smaller storage footprint (e.g., 1.2345 being mapped to 1.23). Bit-plane coding compresses each bit plane of the tensor blocks from the most significant bits to the least significant bits. Residual coding compresses the residual after lossy compression of the original data to provide lossless or near-lossless reconstruction accuracy. Transform coding such as Fourier or Wavelet transform converts an input data from one kind of representation to other kind and the transformed values (or coefficients) are encoded by compression techniques. Other coding techniques may use dictionary-based encoding (e.g., LZW, LZ77), Run Length Encoding (RLE), Entropy Encoding such as Huffman or Arithmetic coding to compress data in lossless manner. RLE represents the sequence of symbols as runs with two parts, the data values and count. For dictionary-based encoding, repeated input pattern may be coded with an index, which is useful if the input sequence contains more repeated patterns. Huffman and Arithmetic coding may be useful when the input source contains a small alphabet size with skewed probabilities, both being variable length codes such that frequently occurring symbols may be coded with less number of bits than rarely occurring symbols.

[0021] In various embodiments, the above-mentioned data is an n-th order tensor and each of the plurality of randomized tensor blocks has an equal or a lower order than the above-mentioned data.

[0022] In various embodiments, the above-mentioned data is big and complex data.

[0023] In various embodiments, the plurality of distributed computing nodes are non-colluding amongst the plurality of distributed computing nodes. For example, cloud servers may be non-colluding and therefore each of them cannot recover the original data (corresponding to the above-mentioned data), even though each of them may have the respective randomized tensor block stored thereon. In various embodiments, non-colluding distributed computing nodes (or non-colluding parties) may be achieved by each distributed computing node (e.g., cloud instance) is owned by a different user within an organization such as to minimize insider attack. In this regard, for example, all of the users would have to provide access to their respective distributed computing node (e.g., cloud instance) in order to recover the original data. In various embodiments, to obtain non-colluding distributed computing nodes, each distributed computing node (e.g., cloud instance) may be run on a different software stack to minimize the chance that they are all vulnerable to exploit available to malware.

[0024] In various embodiments, the plurality of distributed computing nodes each has data security implemented thereat, and each of the plurality of randomized tensor blocks received at the corresponding distributed computing node is subjected to the data security implemented thereat. For example, each distributed computing nodes may have implemented respectively thereat, its own encryption, access control, and/or security mechanisms. Therefore, according to various embodiments of the present invention, distributed trust for data privacy protection is advantageously provided.

[0025] In various embodiments, the plurality of randomized tensor blocks are randomly assigned amongst the plurality of distributed computing nodes for said transmission thereto, respectively. In other words, each of the plurality of randomized tensor blocks is randomly assigned to one of the plurality of distributed computing nodes (respectively, i.e., without overlap in assignment) for transmission thereto. In various embodiments, in the metadata, for each of the plurality of randomized tensor blocks, the identity information in relation to the randomized tensor blocks is anonymized and the location information in relation to the randomized tensor block corresponds to an address of the distributed computing node which the randomized tensor block is assigned to. In various embodiments, the metadata associated with the plurality of randomized tensor blocks stored at the memory associated with the source computing node are encrypted (e.g., encrypted by the source computing node).

[0026] In various embodiments, the method 100 further comprises: transmitting, at the source computing node, a storage request message to first one or more of the plurality of distributed computing nodes (e.g., first group or set of one or more of the plurality of distributed computing nodes) based on the identity information and the location information relating to each of corresponding first one or more of the plurality of randomized tensor blocks for instructing the above-mentioned first one or more of the plurality of distributed computing nodes to store the above-mentioned first one or more of the plurality of randomized tensor blocks received at corresponding one or more memories associated with the above-mentioned first one or more of the plurality of distributed computing nodes, respectively. In various embodiments, the first one or more of the plurality of distributed computing nodes are all of the plurality of distributed computing nodes.

[0027] In various embodiments, the metadata further comprises reconstruction information relating to the plurality of randomized tensor blocks. The method 100 further comprises: transmitting, at the source computing node, a retrieval request message to each of the plurality of distributed computing nodes based on the identity information and the location information relating to each of the plurality of randomized tensor blocks for instructing each of the plurality of distributed computing nodes to transmit the plurality of randomized tensor blocks stored at a plurality of memories associated with the plurality of distributed computing nodes, respectively, to the source computing node; receiving, at the source computing node, the plurality of randomized tensor blocks transmitted from the plurality of distributed computing nodes, respectively, in response to the retrieval request message; and generating, at the source computing node, a reconstructed data corresponding to the above-mentioned data (which may be referred to as the original data) based on the plurality of randomized tensor blocks received and the reconstruction information in the metadata associated with the plurality of randomized tensor blocks. In various embodiments, the reconstruction information may include reconstruction algorithm configured to produce the above-mentioned reconstructed data corresponding to the original data based on the plurality of randomized tensor blocks (e.g., received from the plurality of distributed computing nodes). For example, the above-mentioned data (or original data) may be a particular data and the metadata may thus correspond to that particular data. By way of an example only and without limitation, an original image may be decomposed into three randomized tensor blocks, and the metadata associated with the three randomized tensor blocks stored at the source computing block may include the identities (e.g., anonymized filenames) and locations (e.g., IP addresses of the three distributed computing nodes which the three randomized tensor blocks were assigned or transmitted to) relating to the three randomized tensor blocks and the reconstruction algorithm associated with the three randomized tensor blocks. A reconstructed image corresponding to the original image may then be produced or reconstructed after retrieving all of the three randomized tensor blocks from the three distributed computing nodes.

[0028] In various embodiments, the method 100 further comprises: transmitting, at the source computing node, a computation request message to second one or more of the plurality of distributed computing nodes (second group or set of one or more of the plurality of distributed computing nodes) based on the identity information and the location information relating to each of corresponding second one or more of the plurality of randomized tensor blocks for instructing the above-mentioned second one or more of the plurality of distributed computing nodes to perform a computation on the above-mentioned second one or more of the plurality of randomized tensor blocks stored at corresponding one or more memories associated with the above-mentioned second one or more of the plurality of distributed computing nodes to obtain one or more computed outputs, respectively. In various embodiments, the above-mentioned computation may be one or more of a plurality of multilinear operations, such as but not limited to, addition, multiplication, matrix-by-vector/matrix multiplication and so on In various embodiments, the method 100 may further comprise receiving, at the source computing node, the above-mentioned one or more computed outputs from said the above-mentioned second one or more of the plurality of distributed computing nodes, respectively, in response to the computation request message. In various embodiments, the second one or more of the plurality of distributed computing nodes are all of the plurality of distributed computing nodes.

[0029] In various embodiments, similarly as mentioned hereinbefore, the metadata further comprises reconstruction information relating to the plurality of randomized tensor blocks. The method 100 further comprises: transmitting, at the source computing node, a sharing request message to each of the plurality of distributed computing nodes based on the identity information and the location information in relation to each of the plurality of randomized tensor blocks for instructing each of the plurality of distributed computing nodes to transmit the plurality of randomized tensor blocks stored at a plurality of memories associated with the plurality of distributed computing nodes, respectively, to a second computing node (e.g., a third party computing node); and transmitting, at the source computing node, the metadata associated with the plurality of randomized tensor blocks to the second computing node.

[0030] In various embodiments, the method 100 further comprises: transmitting, at the source computing node, an update request message to third one or more of the plurality of distributed computing nodes (a third group or set of one or more of the plurality of distributed computing nodes) based on the identity information and the location information relating to each of corresponding third one or more of the plurality of randomized tensor blocks for instructing the above-mentioned third one or more of the plurality of distributed computing nodes to perform an update on the above- mentioned third one or more of the plurality of randomized tensor blocks stored at corresponding one or more memories associated with the above-mentioned third one or more of the plurality of distributed computing nodes to obtain a plurality of updated randomized tensor blocks, respectively. In various embodiments, the third one or more of the plurality of distributed computing nodes are all of the plurality of distributed computing nodes.

[0031] It will be appreciated by a person skilled in the art that the method 100 is not limited to the order of the steps as shown in FIG.1, and the steps may be performed in any order suitable or appropriate for the same or similar outcome.

[0032] FIG. 2 depicts a schematic block diagram of a system 200 for distributed data management according to various embodiments of the present invention, such as corresponding to the method 100 of distributed data management as described hereinbefore according to various embodiments of the present invention. The system 200 comprises a memory 202, and at least one processor 204 communicatively coupled to the memory 202 and configured to: decompose, at a source computing node of the at least one processor 204, data randomly into a plurality of randomized tensor blocks based on tensor network decomposition; transmit, at the source computing node, the plurality of randomized tensor blocks to a plurality of distributed computing nodes, respectively; and store, at the memory 202 associated with the source computing node, metadata associated with the plurality of randomized tensor blocks, the metadata comprising, for each of the plurality of randomized tensor blocks, identity information and location information in relation to the randomized tensor block.

[0033] It will be appreciated by a person skilled in the art that the at least one processor 204 may be configured to perform the required functions or operations through set(s) of instructions (e.g., software modules) executable by the at least one processor 204 to perform the required functions or operations. Accordingly, as shown in FIG.2, the system 200 may comprise a tensor network decomposition module (or a tensor network decomposition circuit) 206 configured to perform the above-mentioned decompose, at a source computing node of the at least one processor 204, data randomly into a plurality of randomized tensor blocks based on tensor network decomposition; a tensor block transmission module (or a tensor block transmission circuit) 208 configured to perform the above-mentioned transmit, at the source computing node, the plurality of randomized tensor blocks to a plurality of distributed computing nodes, respectively; and a metadata module (or a metadata circuit) 210 configured to perform the above-mentioned store, at the memory 202 associated with the source computing node, metadata associated with the plurality of randomized tensor blocks, the metadata comprising, for each of the plurality of randomized tensor blocks, identity information and location information relating to the corresponding randomized tensor block.

[0034] It will be appreciated by a person skilled in the art that the above-mentioned modules are not necessarily separate modules, and two or more modules may be realized by or implemented as one functional module (e.g., a circuit or a software program) as desired or as appropriate without deviating from the scope of the present invention. For example, the tensor network decomposition module 206, the tensor block transmission module 208 and/or the metadata module 210 may be realized (e.g., compiled together) as one executable software program (e.g., software application or simply referred to as an“app”), which for example may be stored in the memory 202 and executable by the at least one processor 204 to perform the functions/operations as described herein according to various embodiments. In various embodiments, the tensor block transmission module 208 may be configured to transmit the plurality of randomized tensor blocks to the plurality of distributed computing nodes, respectively, via a wireless signal transmitter or a transceiver (not shown) of the system 200.

[0035] In various embodiments, the system 200 corresponds to the method 100 as described hereinbefore with reference to FIG. 1, therefore, various functions or operations configured to be performed by the least one processor 204 (or by the source computing node of the at least one processor 204) may correspond to various steps of the method 100 described hereinbefore according to various embodiments, and thus need not be repeated with respect to the server 200 for clarity and conciseness. In other words, various embodiments described herein in context of the methods are analogously valid for the respective systems (e.g., the server 200), and vice versa.

[0036] For example, in various embodiments, the memory 202 may have stored therein the tensor network decomposition module 206, the tensor block transmission module 208 and/or the metadata module 210, which respectively correspond to various steps of the method 100 as described hereinbefore according to various embodiments, which are executable by the at least one processor 204 to perform the corresponding functions/operations as described herein.

[0037] A computing system, a controller, a microcontroller or any other system providing a processing capability may be provided according to various embodiments in the present disclosure. Such a system may be taken to include one or more processors and one or more computer-readable storage mediums. For example, the system 200 described hereinbefore may include a processor (or controller) 204 and a computer- readable storage medium (or memory) 202 which are for example used in various processing carried out therein as described herein. A memory or computer-readable storage medium used in various embodiments may be a volatile memory, for example a DRAM (Dynamic Random Access Memory) or a non-volatile memory, for example a PROM (Programmable Read Only Memory), an EPROM (Erasable PROM), EEPROM (Electrically Erasable PROM), or a flash memory, e.g., a floating gate memory, a charge trapping memory, an MRAM (Magnetoresistive Random Access Memory) or a PCRAM (Phase Change Random Access Memory).

[0038] In various embodiments, a“circuit” may be understood as any kind of a logic implementing entity, which may be special purpose circuitry or a processor executing software stored in a memory, firmware, or any combination thereof. Thus, in an embodiment, a“circuit” may be a hard-wired logic circuit or a programmable logic circuit such as a programmable processor, e.g., a microprocessor (e.g., a Complex Instruction Set Computer (CISC) processor or a Reduced Instruction Set Computer (RISC) processor). A“circuit” may also be a processor executing software, e.g., any kind of computer program, e.g., a computer program using a virtual machine code, e.g., Java. Any other kind of implementation of the respective functions which will be described in more detail below may also be understood as a“circuit” in accordance with various alternative embodiments. Similarly, a“module” may be a portion of a system according to various embodiments in the present invention and may encompass a “circuit” as above, or may be understood to be any kind of a logic-implementing entity therefrom.

[0039] Some portions of the present disclosure are explicitly or implicitly presented in terms of algorithms and functional or symbolic representations of operations on data within a computer memory. These algorithmic descriptions and functional or symbolic representations are the means used by those skilled in the data processing arts to convey most effectively the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities, such as electrical, magnetic or optical signals capable of being stored, transferred, combined, compared, and otherwise manipulated.

[0040] Unless specifically stated otherwise, and as apparent from the following, it will be appreciated that throughout the present specification, discussions utilizing terms such as “decomposing”, “transmitting”, “receiving”, “storing”, “generating”, “updating”,“instructing”,“computing” or the like, refer to the actions and processes of a computer system, or similar electronic device, that manipulates and transforms data represented as physical quantities within the computer system into other data similarly represented as physical quantities within the computer system or other information storage, transmission or display devices.

[0041] The present specification also discloses a system (e.g., which may also be embodied as a device or an apparatus), such as the system 200, for performing the operations/functions of the method(s) described herein. Such a system may be specially constructed for the required purposes, or may comprise a general purpose computer or other device selectively activated or reconfigured by a computer program stored in the computer. The algorithms presented herein are not inherently related to any particular computer or other apparatus. Various general-purpose machines may be used with computer programs in accordance with the teachings herein. Alternatively, the construction of more specialized apparatus to perform the required method steps may be appropriate.

[0042] In addition, the present specification also at least implicitly discloses a computer program or software/functional module, in that it would be apparent to the person skilled in the art that the individual steps of the methods described herein may be put into effect by computer code. The computer program is not intended to be limited to any particular programming language and implementation thereof. It will be appreciated that a variety of programming languages and coding thereof may be used to implement the teachings of the disclosure contained herein. Moreover, the computer program is not intended to be limited to any particular control flow. There are many other variants of the computer program, which can use different control flows without departing from the spirit or scope of the invention. It will be appreciated by a person skilled in the art that various modules described herein (e.g., the tensor network decomposition module 206, the tensor block transmission module 208 and/or the metadata module 210) may be software module(s) realized by computer program(s) or set(s) of instructions executable by a computer processor to perform the required functions, or may be hardware module(s) being functional hardware unit(s) designed to perform the required functions. It will also be appreciated that a combination of hardware and software modules may be implemented.

[0043] Furthermore, one or more of the steps of a computer program/module or method described herein may be performed in parallel rather than sequentially. Such a computer program may be stored on any computer readable medium. The computer readable medium may include storage devices such as magnetic or optical disks, memory chips, or other storage devices suitable for interfacing with a general purpose computer. The computer program when loaded and executed on such a general-purpose computer effectively results in an apparatus that implements the steps of the methods described herein.

[0044] In various embodiments, there is provided a computer program product, embodied in one or more computer-readable storage mediums (non-transitory computer-readable storage medium), comprising instructions (e.g., the tensor network decomposition module 206, the tensor block transmission module 208 and/or the metadata module 210) executable by one or more computer processors to perform a method 100 of distributed data management as described hereinbefore with reference to FIG.1. Accordingly, various computer programs or modules described herein may be stored in a computer program product receivable by a system therein, such as the system 200 as shown in FIG.2, for execution by at least one processor 204 of the system 200 to perform the required or desired functions. [0045] The software or functional modules described herein may also be implemented as hardware modules. More particularly, in the hardware sense, a module is a functional hardware unit designed for use with other components or modules. For example, a module may be implemented using discrete electronic components, or it can form a portion of an entire electronic circuit such as an Application Specific Integrated Circuit (ASIC). Numerous other possibilities exist. Those skilled in the art will appreciate that the software or functional module(s) described herein can also be implemented as a combination of hardware and software modules.

[0046] In various embodiments, the system 200 may be realized by any computer system (e.g., desktop or portable computer system) including at least one processor and a memory, such as a computer system 300 as schematically shown in FIG. 3 as an example only and without limitation. Various methods/steps or functional modules (e.g., the tensor network decomposition module 206, the tensor block transmission module 208 and/or the metadata module 210) may be implemented as software, such as a computer program being executed within the computer system 300, and instructing the computer system 300 (in particular, one or more processors therein) to conduct the methods/functions of various embodiments described herein. The computer system 300 may comprise a computer module 302, input modules, such as a keyboard 304 and a mouse 306, and a plurality of output devices such as a display 308, and a printer 310. The computer module 302 may be connected to a computer network 312 via a suitable transceiver device 314, to enable access to e.g., the Internet or other network systems such as Local Area Network (LAN) or Wide Area Network (WAN). The computer module 302 in the example may include a processor 318 for executing various instructions, a Random Access Memory (RAM) 320 and a Read Only Memory (ROM) 322. The computer module 302 may also include a number of Input/Output (I/O) interfaces, for example I/O interface 324 to the display 308, and I/O interface 326 to the keyboard 304. The components of the computer module 302 typically communicate via an interconnected bus 328 and in a manner known to the person skilled in the relevant art.

[0047] FIG. 4 depicts a schematic block diagram of a network system 400 for distributed data management according to various embodiments of the present invention. The network system 400 comprises a plurality of distributed systems (e.g., each may also be embodied as a device or an apparatus, such as a cloud server) 401 comprising a plurality of distributed computing nodes, respectively, each distributed system 401 comprising: a memory 402; and at least one processor 404 communicatively coupled to the memory 402 and comprises the corresponding distributed computing node of the plurality of distributed computing nodes; and a system 200 for distributed data management as described hereinbefore according to various embodiments, such as with reference to FIG.2.

[0048] Accordingly, as described hereinbefore, the system 200 comprises a memory 202, and at least one processor 204 communicatively coupled to the memory 202 and configured to: decompose, at a source computing node of the at least one processor 204, data randomly into a plurality of randomized tensor blocks based on tensor network decomposition; transmit, at the source computing node, the plurality of randomized tensor blocks to a plurality of distributed computing nodes (of the plurality of distributed systems 401, respectively), respectively; and store, at the memory 202 associated with the source computing node, metadata associated with the plurality of randomized tensor blocks, the metadata comprising, for each of the plurality of randomized tensor blocks, identity information and location information relating to the randomized tensor block.

[0049] In various embodiments, the plurality of distributed computing nodes (of the plurality of distributed systems 401, respectively) may thus receive the plurality of randomized tensor blocks, respectively, from the source computing node (of the system 200).

[0050] It will be appreciated by a person skilled in the art that the terminology used herein is for the purpose of describing various embodiments only and is not intended to be limiting of the present invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

[0051] Any reference to an element or a feature herein using a designation such as “first,”“second,” and so forth does not limit the quantity or order of such elements or features. For example, such designations are used herein as a convenient method of distinguishing between two or more elements or instances of an element. Thus, a reference to first and second elements does not mean that only two elements can be employed, or that the first element must precede the second element. In addition, a phrase referring to“at least one of” a list of items refers to any single item therein or any combination of two or more items therein.

[0052] In order that the present invention may be readily understood and put into practical effect, various example embodiments of the present invention will be described hereinafter by way of examples only and not limitations. It will be appreciated by a person skilled in the art that the present invention may, however, be embodied in various different forms or configurations and should not be construed as limited to the example embodiments set forth hereinafter. Rather, these example embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the present invention to those skilled in the art.

[0053] As described in the background, security is an important issue for organizations and enterprises to outsource data storage, sharing, and computation on clouds/fogs. However, data encryption is complicated in terms of the key management and distribution, and existing secure computation techniques are expensive in terms of computational/communication cost, therefore do not scale to big data computation.

[0054] For example, the development of cutting-edge AI systems requires lots of data collected from sensors and IoT devices to achieve high performance in many tasks ranging from business decision-making to personalized services for end customers. Big data requires storage and processing power beyond traditional computing resources, therefore enterprises and government bodies undergoing digital transformation often need to extend their computing capability to the cloud and mobile environments. However, big data may contain sensitive information that can be exploited once placed in the public cloud resources. For example, videos taken by a network of surveillance camera may contain personal information such as individuals’ locations and preferences. As has been reported in the art, the growing attack surfaces from digital transformation calls for simpler data-security solutions such as encryption and access control. However, classical cybersecurity solutions such as end-point security, network security, and digital vault are not scalable and not cost-effective to protect data privacy and security.

[0055] By way of examples, a number of conventional state-of-the-art privacy- preserving big data storage, communication, computation, and sharing techniques will now be described below. [0056] Encryption is a proven technique to protect confidential data by encoding the information in such a way that only authorized parties with the decryption key can decipher it. Encryption may be susceptible to side-channel attacks and many existing public key encryption schemes may be susceptible to quantum-computing attackers in future. Big data encryption is complicated in terms of the key management and distribution, and the re-encryption of large amount of data is a bottleneck to scalability of encryption technique. Therefore, commercial applications may encrypt only confidential data at much smaller scale on clouds, such as relational databases which contain customer and proprietary information. Furthermore, encrypted data computation such as homomorphic encryption may involve very high computational complexity which can incur up to several orders of performance overhead, hence may make it impractical for big data processing.

[0057] Classical Secure Multi-Party Computation (SMPC) may involve multiple rounds of communications and computation using the secret shares or random splitting of a piece of data among multiple servers to securely compute a function (e.g., addition and multiplication). The security model may require the servers to be controlled by mutually-untrusted parties to achieve information-theoretical security. Existing schemes such as garbled circuit may require oblivious transfer and symmetric cryptographic operations in the online phase, and secret sharing scheme may require high round/communication complexity and may only be practical/efficient enough with high-speed networking such as LAN connections. These existing techniques are typically based on modular arithmetic and work on discrete value such as integers and fixed-point representations. Protocols and algorithms have to be redeveloped for continuous value such as floating-point arithmetic, hence may make such classical techniques inefficient and not scalable for modern big data and machine learning applications.

[0058] Hardware enclave or secure enclave is hardware-enforced isolated execution environment (or processor with security support) that allow general-purpose computing on confidential or sensitive data, that is, encrypted data can be decrypted inside the hardware enclaves and computation can be done on the plaintext. For example, INTEL’s SGX and ARM’s TrustZone are paving the way towards realizing hardware enclaves for various applications. Nonetheless, hardware enclaves are still costly and experimental at the time being, most of the processors such as cloud resources, commercial workstations and AI chips nowadays do not have such high- level of security support. Furthermore, hardware enclaves may become the single point of attack failure, and various example embodiments of the present invention note that a new paradigm of distributed trust is needed to ensure the privacy of big data distributed applications, such as secure multi-party computation.

[0059] Data anonymization techniques such as data perturbation and differential privacy inject noise into the data to prevent privacy leakage. However, it may be difficult to control the noise threshold in order to balance usability and privacy of confidential or sensitive data. These transformation techniques are not reversible, therefore result in information loss. Therefore, data anonymization may not provide very effective solution for big data storage, communication, sharing, and computation on public cloud resources.

[0060] Data splitting techniques may partition data into blocks at byte, attribute, or semantic level. Data splitting may be used in the industry to provide layered data protection. However, data splitting releases true values and may require centralized server to keep track of the splitting criterion to ensure that each block does not leak privacy.

[0061] Tensor network computing is a well-established technique among the numerical community. The technique provides unprecedented large-scale scientific computing with performance comparable to competing techniques, such as sparse-grid methods. Tensor network represents functions in a sparsely-interconnected low-order core tensors and factor matrices and the operators by distributed tensor network operations. Tensor network was first discovered in quantum physics in the 1990s, physicists made the first attempt to capture and model the multi-scale interactions among the entangled quantum particles in a parsimonious manner and simulate how they evolve over time using a set of dynamical equations. Tensor network was then independently rediscovered in the 2000s by the numerical community and has found wide applications ranging from scientific computing to electronic design automation. Tensor decomposition, as a multidimensional generalization of matrix decomposition, is a decades-old mathematical technique in multiway analysis since the 1960s. Tensor techniques have been applied for signal processing such as blind source separation and multimodal data fusion to machine learning, such as model compression and learning latent variable models. Tensor network computing has been applied in big data processing due to its ability to model wide variety of data, such as graphical, tabular, discrete, and continuous data; algorithms to cater for different data quality/veracity or missing data; provide real-time analytics for high data velocity such as streaming analytics; and able to capture the complex correlation structure in data with large volume and generate valuable insights for many big data distributed applications.

[0062] Big data generated from sensor networks or Internet-of-Things may facilitate machine learning, in particular deep learning, in order to train cutting-edge intelligent systems for real-time decision making and precision analytics. However, big data may contain proprietary information or personal information such as location, health, emotion, and preference information of individuals which requires proper encryption and access control to protect users’ privacy. Symmetric key encryption works by adding entropy/disorderliness into data using encryption algorithms and pseudo-random number generator so that unauthorized users cannot find pattern from the ciphertext and decipher them, however, higher computational cost is usually incurred with added functionality such as ordered operations (addition/multiplication) in homomorphic encryption and asymmetric keys in public key encryption. Encryption suffers from complicated key management and distribution especially when organizations or enterprises nowadays are undergoing digital transformation to complex computing environments, such as multi-/hybrid-cloud and mobile environments. The field of SMPC originates from Yao's garbled circuit in 1982 where untrusted parties jointly compute a function without disclosing their private input. SMPC has evolved and adopts distributed trust paradigm in recent years given the complex computing environments, increasing attack surfaces, and recurring security breaches. The secret shares are distributed among multiple computing nodes in order to be information-theoretically secure, that is, secure against adversary with unbounded computational resources. SMPC computing primitives include secret sharing, garbled circuit, and homomorphic encryption. The supported secure operations are arithmetic, boolean, comparison, and bitwise operations, and other secure building blocks that are routinely being used in SMPC are oblivious transfer, commitment scheme, and zero- knowledge proof. It is known that fully homomorphic encryption suffers from very high computational complexity, making it impractical to compute complex functions during operational deployment; secret sharing and garbled circuit need many rounds of communications and therefore requires low-latency networks to operate efficiently. Furthermore, garbled circuit involves symmetric encryption during the online phase. The communication complexity of existing practical SMPC protocols may incur runtime delay from an order of magnitude using LAN setting to several orders using WAN setting compared to plaintext processing.

[0063] Various example embodiments note that the quest for scalability calls for innovative data-security solutions which not only simplify privacy management and secure operations but also provide seamless integration between privacy-preserving big data storage/communication and computation/sharing. In this regard, various example embodiments note that a fundamental change in secure computation paradigm from the classical encryption/SMPC techniques is required that scramble at the data-entry level to data-chunk/data-block level based on distributed tensor network representations and distributed computation. In contrast to classical encryption and SMPC techniques which are based on modular arithmetic and works on fixed-point representations, tensor network naturally supports both floating-point and fixed-point arithmetics/operations.

[0064] Accordingly, various example embodiments provide a randomized tensor network decomposition technique (e.g., corresponding to the method of distributed data management as described hereinbefore according to various embodiments) to efficiently decompose big data into fragments (smaller blocks of tensor) with partial information (e.g., corresponding to the plurality of randomized tensor blocks as described hereinbefore according to various embodiments) that are randomized, un- linkable, and not interpretable. In various example embodiments, the above-mentioned partial information may refer to mathematical, structural and/or intrinsic information in each of the randomized tensor blocks after tensor network decomposition being partial, and only with all of the randomized tensor blocks that correspond to a particular data (i.e., the original data), a reconstructed data may be produced corresponding to that particular data. As such, various example embodiments provide a randomized information dispersal method or algorithm. In various example embodiments, these fragments (which may also be referred to herein as distributed tensor network representations or simply tensor blocks) may be distributed among multiple computing nodes (e.g., virtual instances, devices, servers, clouds and so on) controlled by non- colluding parties or one party with multiple authentication factors to provide distributed trust. Furthermore, according to various example embodiments, the fragments may be protected by metadata privacy such that only authorized user is able to recover the original data (or original record) with the metadata that stores the fragments’ location and reconstruction algorithm(s). [0065] Various example embodiments note that the distributed tensor network representations naturally support compressed and distributed/dispersed computation, making it well-suited for big data processing. Furthermore, various example embodiments provide randomized and distributed/dispersed tensor network computation such that the fragments (or tensor blocks) are randomized before and after performing mathematical operation. Various example embodiments note that higher- order tensor decomposition is non-unique in general, and for example, the tensor blocks may be randomized during generation and after performing mathematical operations. For example, addition and multiplication operations may be performed on two pieces of data using the corresponding randomized tensor blocks with distributed tensor operations. The tensor blocks corresponding to each piece of data may first be randomized during the decomposition process. In various example embodiments, after performing arithmetic/multilinear operations using tensor network representations/blocks, the resultant tensor blocks may be compressed and randomized again using rounding algorithms. For example, sophisticated hackers would at least need to gain access to all or most of the communication routes, storage or computing nodes/servers in order to recover the original and processed information.

[0066] Various example embodiments also provide an incremental update scheme or technique of the randomized tensor network representations to cater for real-time streaming data. Furthermore, various example embodiments also provide conversion to-and-fro and operations between the tensor network representations and classical secret-sharing scheme to increase the range of supported secure operations. As the mathematics behind tensor network decomposition techniques that have been well- studied for big data processing (i.e., tensor network computing), the randomized tensor network decomposition algorithms or methods according to various example embodiments is able to decompose and process various kind of data structures, such as tabular, graphical, discrete, or continuous data (e.g., relational databases, graphical databases, structured, unstructured, and semi-structured databases), and pre-process big data for data integration and cleaning, whereby the tensor representations can be updated incrementally or dynamically. Furthermore, advantageously, the randomized tensor network decomposition method according to various example embodiments can be easily integrated into existing computing platforms, environments, and processes (e.g., mobile-cloud environments). [0067] An example implementation framework according to various example embodiments of the present invention will now be described. Various example embodiments make use of the distributed storage and processing of tensor network representations to seamlessly provide privacy-preserving big data storage, communication, computation, and sharing. Privacy and security of distributed/dispersed tensor network representations and computation can be enhanced significantly within the multi-party computation setting by distributing the tensor blocks (or distributed tensor network representations) to different distributed computing nodes (e.g., virtual instances or servers (e.g., cloud servers)) with metadata privacy. In various example embodiments, access control of the fragments (tensor blocks) of the tensor network representations is given to non-colluding parties or one party with different authentication factors on different portions of the fragments for providing distributed trust for data protection. Various example embodiments may be combined with traditional data-security technologies for data-secure implementations, such as data anonymization, differential privacy, data splitting, and encryption to provide layered protection, perform the conversion to-and-fro and/or operations between distributed tensor network and classical secure multi-party computation or secret-sharing scheme or perform computation with the aid of secure-enclave technology to increase the flexibility of secure computing circuits or functionality and computational or communication efficiency, implement digital signatures and hashing or blockchain technology to authenticate the randomized tensor blocks and ensure data availability and integrity, combine with MAC (message authentication code) and digital signatures to provide verifiable computation that is secure against malicious parties, and to compute zero-knowledge proof. In various example embodiments, machine- learning models are compressed and trained in tensor network representations with differential privacy, whereby both the model training and inference may be performed with distributed/dispersed tensor network computation.

[0068] In various example embodiments, metadata of the data fragments contains fragments’ location that correspond to particular record and the reconstruction algorithms to recover the records. For example, a metadata stored at a source computing node may include IP addresses of the recipient nodes (i.e., distributed computing nodes which the randomized tensor blocks have been transmitted to, respectively); filename of the randomized tensor block; storage format; tensor network structure; filename of original data; the location of each tensor block (which is assigned randomly by the source computing node) and the identity information of each tensor block (which is assigned or generated in an anonymized manner by the source computing node), such that it does not reveal the filename of original data.

[0069] In various example embodiments, metadata privacy and security may be achieved based on classical data-security technologies, such as encryption or secret sharing schemes. Both constrained and unconstrained optimization techniques may be used to speed up and compress the distributed tensor network decomposition/computation for various big data applications. The tensor network decomposition may be performed in plaintext, encrypted data computation, or with data perturbation techniques. In various example embodiments, randomization may be achieved with randomization of the hyperparameters during tensor network decomposition, initialization, and computation. Hyperparameters may refer to parameters that are user-defined and independent of the data. For example, the seed of initialization of the tensor blocks, the updating/learning rate of each entry in the tensor blocks, constraints such as smoothness and sparseness parameters, and data sampling process may be randomized during tensor network decomposition.

[0070] According to various example embodiments, decomposing the original data randomly into the plurality of randomized tensor blocks based on tensor network decomposition may refer to any randomization (or any randomization technique) in relation to the tensor network decomposition such that the plurality of randomized tensor blocks are obtained. For example, randomization in relation to the tensor network decomposition may include randomized initialization, randomized hyperparameters selection, randomized sampling, randomized updating rate (or learning rate) of each entry in the tensor blocks, randomized mapping algorithms and so on. For example, various example embodiments note that higher-order tensor network decompositions, such as Tucker and Tensor Train, are in general non-unique, and hence, various technique for performing randomization in relation to the tensor network decomposition to obtain a plurality of randomized tensor blocks are within the scope of the present invention. Canonical Polyadic (CP) decomposition has uniqueness guarantee which may be very useful for physical interpretability. Randomized mapping or projection may utilize a projection matrix like Gaussian, Rademacher, or random orthonormal matrices to project the data tensor to smaller size in relation to (e.g., before executing) the tensor network decomposition. Randomized sampling techniques such as fiber subset selection or tensor cross approximation may choose a small subset of tensor fibers that approximates the entire data tensor well for tensor network decomposition. Existing randomized mapping/projection and sampling algorithms may be utilized for big data reduction to fit into memory the data tensor for tensor network decomposition, therefore the randomized tensor blocks may be compressed with lossy reconstruction accuracy. Various example embodiments adapt these randomization techniques or algorithms for near-lossless reconstruction, e.g., using residual coding. Optimization algorithms, such as stochastic gradient descent and evolutionary computation for tensor network decomposition, are able to approximate big data to arbitrary reconstruction accuracy but may suffer from slow convergence. For example, the randomization can come from randomized initialization, randomized updating rate or learning rate of each entry in the tensor blocks.

[0071] Various example embodiments provide a randomization method or algorithm (or randomized tensor network decomposition) based on a perturbation technique embedded within the tensor network decomposition. In this regard, the randomization method has been found to compress data efficiently with lossy or near- lossless accuracy. In particular, various example embodiments randomly distribute the structural information among the tensor blocks during tensor network decomposition using perturbation. For example, the Higher-Order Orthogonal Iteration (HOOI) for Tucker decomposition can be modified to include a perturbation vector to randomly distribute the singular values of SVD amongst the tensor blocks (e.g., core tensors and factor matrices). It will be appreciated by a person skilled in the art that the above- mentioned perturbation technique can also be applied or extended to other tensor network structures, such as but not limited to, extended tensor train, quantized tensor train-tucker, or tensor ring decomposition, which is closely related to the tensor train decomposition described herein according to various example embodiments. Accordingly, the resultant tensor blocks may be randomized due to the sensitivity of SVD subject to perturbation. For example, an example randomized tensor network decomposition as will be described later below with reference to FIG.13 (Algorithm 1) may be adapted for missing data imputation by replacing the SVD with CUR decomposition in the randomized tensor network decomposition, which utilizes fiber subset selection for matrix decomposition. In various example embodiments, other low- rank matrix factorization algorithm may be employed in the tensor network decomposition (e.g., tensor-train decomposition), such as randomized SVD based on randomized projection method, Robust Principal Component Analysis, Component Analysis, Non-Negative Matrix Factorization, Sparse Component Analysis, Independent Component Analysis, and so on. In various example embodiments, SVD is employed as it is found to be much more efficient according to various example embodiments of the present invention.

[0072] Various example embodiments may be implemented at filesystem, database, or application levels of existing software architecture. For example, various example embodiments may operate on the byte, attribute, or semantic level of data. The data may be compressed or approximated using tensor network with lossy or lossless accuracy. Near-lossless data accuracy may be achieved with tensor network lossy compression and residual coding. Further compression of tensor network may be performed with existing codec, such as dictionary-based compression, run-length encoding, and arithmetic coding.

[0073] Accordingly, according to various example embodiments, tensor networks decompose big data (or big and complex data) into fragments (a plurality of randomized tensor blocks) with partial information which are not recognizable and not interpretable, and the fragments may then be communicated and stored on single device/cloud (e.g., multiple virtual instances in the single device) or multiple devices/clouds with metadata privacy. Accordingly, various example embodiments advantageously simplify the encryption key management and distribution by providing distributed trust for big data applications in complex mobile-cloud environments nowadays. For example, based on the distributed data management method according to various example embodiments, enterprise may share data by transferring the metadata and data fragments through secure communication channels without exchanging decryption keys or doing re- encryption. Accordingly, the distributed data management method according to various example embodiments may be advantageously be applied to big data decomposition and compression and inherently supports compressed and distributed computation. Therefore, the big data processing is advantageously simplified by removing the need to decompress or preprocess the data compared to classical privacy-preserving computation techniques. Furthermore, the distributed nature of the distributed data management method according to various example embodiments naturally supports secure multi-party computation based on fixed-point or floating-point arithmetic and reduces the communication overhead between multiple computing nodes. For example, the distributed data management method according to various example embodiments provides layered protection against sophisticated hackers with the sea of big data to search for fragments that correspond to particular record. The distributed data management method may be combined with existing privacy-preserving techniques and seamlessly integrated into existing computing environments, platforms, or processes.

[0074] By way of examples only and without limitations, FIGs.5A and 5B depict tables and diagrams systematically comparing the distributed data management method (e.g., the tensor network decomposition method, which may also be referred to as the big data or tensor shredding method) according to various example embodiments of the present invention against existing data-security solutions based on technical parameters, and more particularly, comparative analysis in relation to private storage/sharing. Note that these techniques are not mutually exclusive with each other, but may be combined to provide layered protection.

[0075] Tensor decomposition is a well-established technique in the field of signal processing and machine learning. Tensor decomposition or tensor networks decompose higher-order tensors into sparsely-interconnected small-scale factor matrices and/or low-order core tensors. These low-order core tensors may be referred to as “components”,“blocks”,“factors”, or“cores”, which encode the intrinsic or latent information of the original data. The tensor techniques may be used on different kind of data (e.g., tabular or structured data, graphs/networks, continuous and discrete data, semi-structured, and unstructured data) and has emerged as a promising technique for big data processing, compression, and analytics. A key reason is because of their ability to capture the complicated correlation structure of big data/model. By way of examples only and without limitations, FIGs. 6A, 6B and 7 illustrate a few examples of basic tensor network models. For example, the nodes shown in FIG. 7 may correspond to smaller tensor blocks, the edges may correspond to the dimensionality of each block, the connections may represent particular mathematical operations needed for data reconstruction, and the number of free edges may correspond to the number of dimensions in the original tensor/data block.

[0076] Tensor network decomposition (or factorization or approximation) comprises matrix decomposition techniques based on basic tensor network formats such as shown in FIGs.6A, 6B and 7. In particular, FIGs.6A and 6B illustrate different tensor networks representations, including Canonical Polyadic (CP), Tucker Decomposition (TD), Hierarchical Tucker (HT), and Tensor Train (TT). In FIGs.6A and 6B, {I, J, K, L} and R_k refer to the data modes and ranks, respectively. FIG. 7 illustrates tensor networks with different network topology using graphical representations, including CP, TD, HT, and TT. I_k and R_k refer to the data modes and ranks, respectively. For example, there may be hybrid tensor network formats that combine two or more basic formats into one tensor network representation, generalized tensor network decomposition that involves other mathematical operations between the tensor blocks, more sophisticated or high-dimensional formats, such as Matrix Product States (MPS), Projected Entangled Pair State (PEPS) and Multi-Scale Entanglement Renormalization Ansatz (MERA). In the field of large-scale scientific computing, tensor network representations are commonly used to model the large number of parameters and their complex variations described by the dynamic equations, and to compute using distributed tensor network operations how the interactions between the subsystems evolve over time. The ability to do compressed distributed computation using tensor networks has dramatically improve the computational efficiency and memory requirements in scientific computing as well as machine learning models without compromising the simulation or model performance.

[0077] FIG.8 shows an example of distributed/dispersed computation that can be performed with distributed tensor network operations by way of an example only and without limitations. In particular, FIG. 8 illustrates multilinear operations, such as addition, multiplication, matrix-by-vector/matrix multiplication, and so on (e.g., various multilinear operations disclosed in“Lee, Namgil, and Andrzej Cichocki, Fundamental tensor operations for large-scale data analysis using tensor network formats, Multidimensional Systems and Signal Processing 29.3 (2018): 921-960”), can be performed in a distributed/dispersed manner to enhance the efficiency and privacy, according to various example embodiments of the present invention. For example, information may be dispersed throughout the storage, communication, and computation, hence ensuring the data privacy of the original inputs without sacrificing efficiency.

[0078] FIG.9 depicts an example distributed data management method in relation to big data at-rest and in-transit security according to various example embodiments of the present invention. Data owner may decompose (or shred or approximate) their data using tensor network decomposition and distribute the smaller tensor blocks to multiple clouds, hybrid clouds, multiple virtual instances of a single cloud, servers, or devices. The decomposition (or shredding) process may be performed in plaintext, anonymized/perturbed data, or encrypted data. The tensor blocks may be further compressed using existing codec before or after distribution to multiple storage points. According to various example embodiments, there are a number of ways to realize the secure multi-party computation setting to provide distributed trust depending on the security model. As a first example, the data owner may store the tensor blocks on multi- clouds or hybrid-clouds environments. As a second example, the data owner may store the tensor blocks to multiple virtual instances with different authentication factors, however this may potentially leak the data privacy to cloud administrator but still resists hacking by removing single point of failure. As a third example, an organization may give access to different sub-organization portions of the tensor blocks that correspond to particular record (or a particular data), and this may be implemented in single cloud, multiple or hybrid clouds. A data owner may also distribute the tensor blocks to multiple devices and mobile-cloud environments. In various example embodiments, metadata of the data fragments corresponding to particular record may be protected by encryption or secret sharing scheme. In various example embodiments, the metadata stores the fragments’ location and reconstruction algorithm in order to recover the original data. Following the mathematical formulations in tensor network (e.g., Equation 1 as will be described later below), according to various example embodiments, one (e.g., a data owner) can recover a particular data (e.g., an original data or information or a particular record) using the randomized tensor blocks that correspond to that particular data. For example, the reconstructed data accuracy can be lossy, lossless, or near-lossless. The data owner or organization may first retrieve the metadata from the storage points, locate and download the data fragments/tensor blocks from the clouds, server, or devices, reconstruct the original data using the reconstruction algorithm stored in the metadata. In various example embodiment, the example distributed data management method may be combined with existing privacy- preserving technologies, such as but not limited to, anonymization, secret sharing, encryption, and secure enclave, to provide layered protection and enhance the functionality or efficiency.

[0079] FIG.10 depicts an example distributed data management method in relation to secure big data sharing according to various example embodiments of the present invention. Data or content owner may instruct the clouds or devices that store the fragments corresponding to a particular data (e.g., an original data or a particular record) to give access to an intended user to share that particular data. The data or content owner may also share the metadata corresponding to that particular data to the intended user so that the user can reconstruct that particular data using the metadata and data fragments associated with that particular data. All the communication may be performed on multiple secure communication channels to provide distributed trust. Compression and granular data access control on the dataset or database can be simultaneously achieved with the example distributed data management method according to various example embodiments.

[0080] FIG.11 depicts an example distributed data management method in relation to privacy-preserving big data computation according to various example embodiments of the present invention. Data or content owner may instruct the clouds, servers, or devices that contain the tensor blocks (or data fragments) corresponding to a particular data (e.g., an original data or a particular record) to perform secure multi-party computation based on distributed tensor network representations and operations. In various example embodiments, the secure computation may be combined with existing privacy-preserving techniques, such as secret sharing, encryption, and secure enclave. Each cloud, server, or device may contain and communicate only partial information (i.e., the respective randomized tensor block stored thereat) and therefore hackers would at least have to gain access to multiple routes, storage, and computing nodes in order to reconstruct that particular data. The data owner may also send incremental updates to the storage/computing nodes to update the tensor blocks using distributed/dispersed tensor network computation to ensure the updated data remains compressed.

[0081] FIG.12 depicts an example distributed data management method in relation to secure multi-party computation according to various example embodiments of the present invention. Multiple data or content owners may instruct the clouds that contain the data fragments (or tensor blocks) corresponding to respective records to perform secure multi-party computation based on distributed tensor network representations and computation. Each data owner has control on different portions of the shared fragments (i.e., the respective randomized tensor block stored thereat) to ensure privacy and fairness of joint computation. The distributed tensor blocks of the computed function may be retrieved by the data or content owners. According to various example embodiments, the example distributed data management method may be combined with secret sharing, encryption, and/or secure enclave to increase the functionality and improve the computational or communication efficiency. [0082] As more enterprises undergoing digital transformation, cloud computing becomes inevitable for their daily commercial operations. For example, sharing of data collected between multiple parties increases mutual benefits (e.g., banks want to combine their data for fraud detection, hospitals want to increase the size of their database for more accurate diagnosis or predictions, and so on), but this also incurs privacy-preservation issues. In this regard, the distributed data management method according to various example embodiments advantageously provides efficient and secure big data sharing, secure multi-party computation, and scalability for privacy- preserving big data computation. Accordingly, for example, the distributed data management method according to various example embodiments can help to secure big data applications such as data warehouse (e.g., data cleaning and integration), database/database query, operations and analytics, and filesystem privacy for distributed software applications on clouds, fogs, edges, and devices to facilitate the digital transformation and digital data sharing within and cross enterprises ranging from healthcare, smart manufacturing, and smart cities applications. Furthermore, the distributed data management method according to various example embodiments can be used for compressed and private computation of machine learning models and large- scale numerical computing.

[0083] Although distributed tensor network representations and distributed computation have been used in signal processing and machine learning for dimensionality reduction and large-scale optimization, various example embodiments note that the potential of distributed tensor network for big data privacy preservation have not found to have been considered in the art. In this regard, various example embodiments note that tensor network representations are mathematically non-unique, un-linkable, and un-interpretable, therefore the distributed representations naturally support a range of multilinear operations for compressed and distributed computation. Accordingly, various example embodiments provide randomized algorithms (which may also be referred to as randomized tensor network decomposition) to randomly decompose big data into randomized tensor network representations (a plurality of randomized tensor blocks) and analyse the privacy leakage of distributed tensor operations. The computational and communication complexity are benchmarked against existing secure computation techniques. According to various example embodiments, the tensor representations may be distributed on multiple clouds/fogs or servers/devices with metadata privacy, so as to provide both distributed trust/management to seamlessly secure big data storage, communication, sharing, and computation.

[0084] For better understanding of the present invention and without limitation or loss of generality, the data distributed data management method according to various example embodiments of the present invention will now be described with respect to the tensor network decomposition being tensor train (TT) decomposition based on singular value decomposition (SVD) (which may herein be referred to as TT-SVD decomposition) and secret-sharing scheme or technique, unless stated otherwise. In particular, with the impressive track records of distributed tensor network in large-scale scientific computing and big data analytic, various example embodiments provide an example secret-sharing scheme or technique based on tensor network and investigate its feasibility for privacy-preserving big data distributed applications. Accordingly, various example embodiments provide:

° an arithmetic secret-sharing scheme based on distributed tensor network representations and distributed operations. It is also shown how to randomly generate the secret shares based on tensor network and perform arithmetic operations securely in multi-party computation setting.

° analyse the information / privacy leakage of the secret sharing scheme based on tensor network, and carry out cryptanalysis to verify the proposed secret-sharing scheme for different kind of data types or structures.

° show how to convert securely to-and-fro and operate between the example secret-sharing scheme according to various example embodiments and the classical additive secret-sharing scheme. Related Work

[0085] The state-of-the-art secure computation and collaborative private computation techniques for privacy-preserving machine learning will now be further discussed. On one hand, the importance of data privacy cannot be overemphasized, on the other hand, the trained machine learning models are important assets for enterprises to gain competitive advantages. Furthermore, the models may also leak sensitive information of the training data, therefore model privacy is also carefully studied in the literature. Recent surveys on the privacy and security issues of machine learning have been disclosed. The privacy issues are exacerbated in deep learning systems because the model training involves an enormous amount of data that often contains sensitive information. There are broadly three privacy attacks on machine learning models: (1) membership attacks to determine whether particular data point is in the training dataset; (2) training data extraction by model inversion techniques; (3) model parameters extraction using the input and corresponding output.

[0086] Secure Multi-Party Computation (SMPC). Homomorphic encryption schemes may provide the strongest privacy for computation on encrypted data using third-party servers such as clouds, however, they usually incur high computational and storage overhead and require a trusted authority to generate and distribute the public/private keys for all the parties. For example, SPDZ and its recent development (e.g., MASCOT) is a popular SMPC scheme for secure computation due to their efficiency and security guarantee against malicious adversary. All these schemes have to be carefully adapted into machine learning because it involves multi-step sequential computation during the inference stage and iterative computation during the training stage. In deep learning, the state-of-the-art models are usually much bigger (e.g., billion to trillion parameters) with multiple/many layers of matrix multiplication/convolution and non-linear activation steps. Arithmetic secret sharing and homomorphic encryption are typically used in executing arithmetic operations such as matrix multiplication/convolution, whereas Garbled circuit is used in executing boolean operations such as rectified linear units, max pooling, and their derivatives. Garbled circuit incurs a multiplicative overhead proportional to the security parameter in communication and requires oblivious transfer protocols, as well as symmetric encryption during the online phase. Share conversion protocols may be used in deep learning to convert from an arithmetic encoding to a Boolean encoding and vice-versa, which is another source of inefficiency. Currently, no practical privacy-preserving computation for the training of state-of-the-art deep learning models has been found. For model inference, there is some hope in terms of scalability, this is because model training may usually be done in federated and differentially-private manner for horizontally-partitioned data, whereas model inference may typically be done in central manner with no differential privacy on the data input in order to obtain accurate predictions. In general, machine learning models do not require high-precision computation, therefore efficient and secure computation can be achieved with quantization/binarization, sparsification/model pruning, and parallel distributed computation. In particular, deep learning models can be pre-processed according to the SMPC protocols to render significantly faster inference. There has been disclosed several surveys of machine learning and deep learning techniques with privacy preservation that systematically compare different SMPC protocols.

[0087] Trusted Execution Environment (TEE) or secure enclave is a secure execution environment that protects applications running inside the enclave from malicious code outside such as compromised operating system or hypervisor. Secure enclaves have been adopted in diverse environments ranging from cloud servers, client devices, mobile and Internet-of-Things (IoT) devices, to embedded sensors to securely store and process sensitive information, such as cryptographic keys, biometric data like fingerprints and face identity information, and key management. A small codebase is typically easier to secure using static and dynamic verification tools, as well as sandboxing, therefore it may be important to split AI system’s code into minimal codebase running inside the enclave and code running outside in untrusted mode by leveraging cryptographic techniques. In the context of privacy-preserving machine learning, the disk, network, and memory access patterns have to be data-oblivious in order to prevent side-channel attacks that may leak large amount of sensitive data. Additionally, a few studies have been done to improve the model training efficiency, privacy, and security using enclave, e.g., Myelin’s training on multithreaded enclave, Chiron’s distributed training on several enclaves, Slalom’s GPU delegation of matrix multiplication, CalTrain detects poisoned and mislabeled training data that lead to the runtime mispredictions, and new differentially-private and oblivious sampling algorithms have been proposed for trusted processors. Existing commercially available hardware-enforced isolated execution environments including INTEL’s SGX and ARM’s TrustZone are performant and general-purpose CPUs that provide remote attestation to verify that the enclave is running expected codes, but they have limited resources such as addressable memory, they have no specialized AI accelerators such as GPUs or TPUs, and are not customizable for different services or applications. An open source framework named Keystone that is based on RISC-V open instruction set provides a promising direction to build and instantiate customizable TEEs with simple modular design.

[0088] Differential Privacy and Federated Learning. Differential privacy is a mathematical framework to rigorously quantify the amount of information leaked on each item of a training dataset in machine learning. The framework was initially proposed for a (central) statistical database to bound the privacy leakage of individual’s information from one-time or repeated query mechanisms. The framework was then extended to protect the (local) privacy of decentralized or distributed data in federated learning, such that the data owners only share (partial/subset of) model updates instead of their own data to the central server. Differential privacy can be applied at different stages of federated learning, e.g., individual or aggregated updates. However, recent research questions the effectiveness of differential privacy for well-trained models, especially with prior knowledge or more detailed information of the sensitive data to guide the model inversion.

[0089] Data Anonymization is perhaps the simplest low-cost solution that may be adopted for secure data sharing within and across enterprises for diverse applications, including machine learning. Data anonymization techniques cover both the removal of personally identifiable information (e.g., using hashing and masking techniques) and data randomization/perturbation techniques (e.g., random noise, permutation, and transformation). The random components or functions have to be carefully designed to preserve important information in the training dataset and ensure model performance. Systematic survey of different privacy metrics have been proposed over the years. These privacy metrics are based on information theory, data similarity, indistinguishability measures, such as differential privacy, and others, to choose a suitable privacy metric for a particular setting depends on the adversarial model, data sources, information available to compute the metric and the properties to measure. There has been disclosed a data anonymization approach to generate synthetic data that resembles the statistical distribution or behaviour observed in the original datasets using generative machine learning models, such as generative adversarial networks and computer simulations. However, these models/simulations are application-specific (i.e., depend on the training dataset or physical models) and any analysis on the synthetic data has to be verified over the real dataset for validation.

[0090] Others. Blockchain is a distributed ledger that may record transactions in a cryptographically verifiable or tamper-proof manner. Blockchain may be used to create smart contracts between mutually-untrusted parties to automate the workflow of collaborative model training and ensure that the data integrity and process immutability of SMPC and TEE. Data Capsule presents a new paradigm for static enforcement of privacy policies by deriving residual policies based on abstract interpretation, their approach provides automatic compliance checking of data privacy regulations even with heterogeneous data processing infrastructures. Machine learning has also been used to discover and classify sensitive data for enterprises to save manpower in manual checking process while ensuring regulation compliance (e.g., Amazon Macie). Although privacy-preserving matrix and tensor decomposition techniques have been well studied in the literature, various example embodiments note that distributed tensor network representations and computation have not been proposed for privacy preservation and secure computation. Threat Model and Security

[0091] Consider a set of client C₁, C₂,… , C _^^ who want to jointly train a machine learning model on their private inputs, the data can be horizontally or vertically partitioned, or secret-shared among them as part of previous computations. The secure computation may be outsourced to a set of untrusted but non-colluding servers S₁, S₂,… , S_m, the clients simply need to distribute or secret-share their inputs among the servers in the initial setup phase, the servers then proceed to securely compute and communicate using SMPC protocols. The servers can be run on different software stacks to minimize the chance that they all become vulnerable to the exploit available to malware attacks and can be operated under different sub-organizations to minimize insider threats. For example, given the cloud scenario, the secret shares can be distributed to different cloud accounts provided by the same cloud service provider (CSP) or to different clouds run by different CSPs (e.g., multi-cloud or hybrid-cloud environments). Various example embodiments assume a semi-honest adversary A (or so-called honest-but-curious adversary) who can corrupt any subset of the clients and at most ^^ - 1 servers at any point of time. This security definition may require that an adversary learn only the corrupted clients’ inputs but nothing else about the honest clients’ inputs beyond the trained model. As will be evident later, the example secret- sharing scheme based on tensor network according to various example embodiments may not be symmetric with respect to the servers, each server including index-specific information. The example secret-sharing scheme based on distributed tensor networks will now be described below in further detail according to various example embodiments the present invention. In particular, example secure re-sharing protocols and conversion to-and-fro the classical additive secret-sharing scheme are disclosed to preserve the privacy of the computed outputs, thereby enabling the proposed protocols to be arbitrarily composed for complex models. In various example embodiments, the tensor network representations satisfy the data complexity (or rank complexity) criteria discussed and theoretically verified later below to be privacy preserving in multi-party setting. Secret-Sharing Scheme based on Distributed Tensor Networks

[0092] While encryption is a well-proven and highly-efficient technique in securing communications over the internet, encrypted database may not be widely adopted because of the complicated key management and distribution problem. Furthermore, the centralized key management also makes the encrypted database approach not scalable for big data distributed applications within or across enterprises. State-of-the- art SMPC, on the other hand, cannot catch up with the fast-paced machine learning and big data applications due to high computational and communication complexity. In this regard, various example embodiments provide an example secret-sharing scheme or technique based on tensor network distributed representations and distributed operations (e.g., corresponding to the method of distributed data management as described hereinbefore according to various embodiments of the present invention) to seamlessly secure big data storage, communication, sharing, and computation. Tensor network decomposes data chunk (or data block) at the semantic level, the distributed representations include latent information and distributed among multiple servers with metadata privacy. For example, the distributed and non-colluding computing nodes may only know the anonymized filename and format of their received tensor blocks, but each of the distributed computing nodes does not know the filename and tensor network structure of original data. Each server has its own encryption, access control, and security mechanisms, thus provide distributed trust for data privacy protection in complex computing environments. By way of examples only and without limitation, randomization algorithms or methods will be described with reference to FIGs.13 and 15 for decomposing data chunk/block into randomly-distributed tensor representations, and with reference to FIG. 16 for securely updating the tensor representations. The conversion to-and-fro the classical additive secret-sharing scheme will also be described.

[0093] The success of multiway component analysis may be due to the existence of efficient algorithms for matrix and tensor decomposition and the possibility to extract components with physical meaning by imposing constraints such as sparsity, orthogonality, smoothness, and non-negativity. Various example embodiments note that higher-order tensor decomposition is typically non-unique, each core or sub-block contains index- specific information which are un-linkable and non-interpretable. As explained hereinbefore, various example embodiments may be described with respect to tensor-train (TT) decomposition due to their flexibilty for distributed multilinear operations and the possibility to convert other specific tensor models (e.g., Canonical Polyadic, Tucker, and Hierarchical-Tucker decomposition) into TT format, however, it will be appreciated by a person skilled in the art that the present invention is not limited to TT decomposition. For example, similar properties apply to tensor chain or tensor ring format (TR), which is a linear combination of TT formats. TR representations may be more generalized/powerful compared to TT representations with smaller ranks; whereas extended TT decomposes the TT-cores into smaller blocks. For example, low rank approximation is very useful in tensor network computing for saving storage, communication, and computational cost with negligible loss in model accuracy. Accordingly, it will be appreciated by a person skilled in the art that the example randomized algorithms described with reference to TT representations can be easily extended to the TR and extended TT formats, and need not be described herein for clarity and conciseness.

[0094] Share Generation. In various example embodiments, the TT decomposition may be based on the original TT-SVD algorithm as disclosed in,“Ivan V Oseledets. 2011. Tensor-train decomposition. SIAM Journal on Scientific Computing 33, 5 (2011), 2295-2317”, the content of which being hereby incorporated by reference in its entirety for all purposes. In various example embodiments, the TT decomposition may be defined as:

(Equation 1)

[0095] A₃ (i₁ i₂, i₃) is a third order tensor with index i_k and size of each dimension I_k. G [i_k] is a matrix of r_k-1 X r_k, where r₀ = r₃ = 1 and X is a matrix multiplication operation. The storage complexity of TT decomposition is 0(NIR²) , where / = max(/₁, /₂ and R = max(r₁, r₂, ... , r_N-1) . The rank of the TT-core

(r_1, r₂, ... , r_N) is called the TT-rank. denotes the private share stored in server f.

The communication cost is highly efficient if the TT-cores are compressible. TT decomposition is mathematically non-unique, each core contains only index-specific information and therefore they are un-linkable. [0096] By way of an example and without limitation, FIG. 13 shows a randomization algorithm (Algorithm 1) that presents a randomized TT-SVD algorithm that decomposes N-dimensional data into randomized secret shares. In this regard, according to various example embodiments, SVD may decompose a matrix A₂ (i₁, i₂) into left and right singular vectors, the basis vectors are ranked by the amount of explained variation in A₂ (i₁, i₂) or the so-called singular values. Mathematically, SVD may be given =U S V^T, where U and V are orthonormal matrices that contain the left and right singular vectors in their respective columns, the diagonal elements of å matrix contain the corresponding singular values. Algorithm 1 is based on the above-mentioned original TT-SVD algorithm, which performs sequential (truncated) singular value decomposition (SVD) on the full tensor _N (i₁, i₂, ··· , i_N) in order to obtain the TT decomposition.

[0097] FIG. 14 shows the graphical representation of the example randomized TT- SVD algorithm (Algorithm 1). In particular, FIG. 14 illustrates the example randomized TT-SVD algorithm for a 3^rd-order tensor. The nodes represent tensor and the number of free edges represent the tensor order, e.g., a node with two edges represents a matrix. The edges that connect two nodes refer to multilinear operations that connect the two tensors. According to various example embodiments, to balance between compression (reconstruction error £ S) and randomness, the maximum (randomized) perturbation vector d should be within certain threshold based on the magnitude of each singular value, and positive and negative sign differences of the corresponding singular vector. As an example, Algorithm 1 uses a uniformly-distributed perturbation vector d. According to various example embodiments, the perturbation is embedded in the TT- SVD algorithm to randomize the distribution of singular values between the core tensors (tensor blocks) during decomposition. tSVD refers to truncated SVD that truncates the singular values and vectors to reduce storage size given the relative approximation error e, the truncation parameter is set so for each core tensor

as derived from the original TT-SVD algorithm in order to balance the truncation among the physical modes. Various example embodiments may be based on the perturbation effects of SVD for complex data, namely, the closeness of the singular values determine the sensitivity of the corresponding singular vectors subject to perturbation. For example, complex data usually result in SVD with closely-separated singular values, therefore the tensor blocks generated using Algorithm 1 are highly randomized given the large-but-controlled perturbation. With modest rank complexity (e.g., > 10), the randomization can give rise to more than a set of 10¹⁰ randomized tensor blocks, real-life data usually results in rank complexity of (> about 100). Accordingly, without the right set of randomized tensor blocks, the original data cannot be fully recovered using Equation 1 for tensor-train reconstruction. The share regeneration can be done with the example randomized TT-SVD algorithm all carried out in TT format. However, the example secret-sharing scheme is asymmetric (each party stores index- specific information), therefore various example embodiments design or configure the re-sharing procedure such that the new shares are exchanged without any party retains their generated new share to prevent data reconstruction. In general, the partition of more sophisticated tensor network structure into private and shared cores may be performed with hierarchical clustering based on pairwise network distance and randomized algorithms that minimize privacy leakage, communication, and computational cost.

[0098] Privacy and Correctness. The correctness of tensor network format is obvious and the representations are compressible if the data admits low-rank structure. According to various example embodiments, the example randomized algorithm simply splits the structural information (or correlation structure) randomly into different cores. The sensitivity of SVD decomposition subject to small perturbations is well-known for complex correlation structure, i.e., when the singular values are closely separated. Moreover, the example algorithm randomizes the decomposition by large-but- controlled perturbations that does not affect the data reconstruction accuracy. The privacy leakage is limited to the rank complexity of each index, i.e., index that has sufficient rank complexity is privacy-preserving, whereas index that has only zeroes in the TT-cores implies that all the values that correspond to this index are zero. The magnitude, sign, and exact position of non-zero values are not leaked even with collusion by all-except one servers.

[0099] Tensor Network Computing naturally supports a number of multilinear operations in floating-point/fixed-point representations with minimal data- preprocessing (e.g., addition, multiplication, matrix-by-matrix/vector multiplication, and inner product), unlike classical SMPC schemes that only support limited secure operations (e.g., addition or multiplication) and has to be pre-processed every time to carry out different operations, which involves many rounds of communication complexity in order to compute complex functions. With tensor network representations, multilinear operations can be performed in compressed distributed manner without the need to reconstruct the original tensor, this is a major advantage of tensor computation in overcoming the curse of dimensionality (or intermediate data explosion) for large scale optimization problems. The TT-ranks grow with every mathematical operations and quickly become computationally prohibitive, the TT- rounding (or recompression) procedure can be implemented to reduce the TT-ranks by (1) orthogonalization of each of the tensor blocks sequentially using QR decomposition from the rightmost tensor block to the leftmost tensor block in TT format and (2) compression and randomization of each of the tensor blocks sequentially using SVD decomposition from the leftmost tensor block to the rightmost tensor block in TT format all in distributed manner. It will be appreciated by a person skilled in the art that the example randomized TT-SVD algorithm can be easily extended to the second step of TT-rounding procedure.

[00100] Randomized TT-rounding (e.g., based on the Ivan V Oseledets reference mentioned hereinbefore) in Algorithm 2 as shown in FIG.15 may be used to reduce the rank complexity after performing multilinear tensor operations, such as addition, multiplication, matrix multiplication, and so on. For example, Algorithm 2 includes performing QR decomposition to orthogonalize each of the tensor blocks and performing SVD for compression of each of the tensor blocks, whereby the randomization is performed by a perturbation vector that randomly splits the singular values to the core tensors (or tensor blocks), such as similar to Algorithm 1. Mathematically, QR decomposition is a decomposition of a matrix into a product of an orthogonal matrix Q and an upper triangular matrix R, A₂ (i₁, i₂) = QR .

[00101] Algorithm 3 as shown in FIG.16 is a randomized TT incremental updating algorithm based on Algorithms 1 and 2. In various example embodiments, Algorithm 3 may first decompose the incoming data tensor to randomized tensor blocks using Algorithm 1, then pad the old and new data tensor in TT format so that their dimensions are consistent, perform addition of old and new data tensor in distributed manner in TT format, and subsequently perform Algorithm 2 in distributed manner to reduce the rank complexity or storage size of the resultant randomized tensor blocks. The addition of two data tensor in TT format may be performed in distributed manner. By way of an example only and without limitations, Z = X + Y may be performed on each TT core where is the direct sum as defined in, for example,

“Lee, Namgil, and Andrzej Cichocki. Fundamental tensor operations for large-scale data analysis using tensor network formats. Multidimensional Systems and Signal Processing 29.3 (2018): 921-960”.

[00102] Relationship with Additive Secret-Sharing Scheme. The classical additive secret-sharing scheme is defined by

. In various example embodiments, the conversion from the

classical scheme to secret-sharing scheme based on TT format may be achieved by each party decomposing their individual share using the example randomized TT-SVD algorithm (i.e., Algorithm 1) and send to other parties the corresponding TT-core. All parties may add up their corresponding TT-core and reduce the size of each core using the example randomized TT-rounding procedure (i.e., Algorithm 2). The conversion from TT format to the additive secret-sharing scheme may be performed by all-except- one parties generate randomized TT-cores with rank complexity similar to their own TT-core, pass the generated TT-cores to the corresponding party and update all the TT- cores using TT-rounding procedure, all-except-one parties pass their updated TT-core to the remaining one (that did not generate randomized TT-cores before) to generate his additive secret share and other parties reconstruct their additive secret shares using their randomized TT-cores generated before. Tensor network representation can operate with another tensor in full or tensor network format, therefore it is expected that tensor network can operate with additive secret-sharing scheme to increase the range of supported secure operations. Secure Big Data Storage, Communication, and Sharing with Metadata Privacy

[00103] Public key infrastructure (PKI) is an established technology to secure communications over the internet. PKI uses asymmetric or public key encryption to exchange private keys and create digital signatures for sender authentication and message non-repudiation, whereas symmetric key encryption is used for data encryption. However, it is non-trivial to secure data storage and sharing on semi-trusted third parties such as clouds. Conventional encryption techniques would require the data owners to download, decrypt, and re-encrypt the requested data in case access policies change dynamically/frequently. Proxy re-encryption is a mathematical technique to seamlessly secure distributed data storage and sharing using clouds without leaking data privacy to the semi-trusted third parties. Proxy re-encryption is based on public key encryption, the technique offloads the heavy computational cost of big data re- encryption to third-party servers, however, most algorithms used nowadays are highly susceptible to quantum-computing attacks.

[00104] Enterprises and organizations usually classify data (using human labels or machine classification) into sensitive data that requires advanced encryption and access control mechanisms to secure data storage and communication, less sensitive data that requires only anonymization techniques for sharing, and non-sensitive data in order to ensure regulatory compliance. Encryption may be expensive in terms of key management and distribution for multiple or many users/devices, and it may require centralized management by a trusted authority to authenticate, authorize, and revoke access to prevent key leakage that may lead to massive data breach. Data anonymization (or data randomization) techniques may result in loss of important information and is difficult to control the noise threshold to balance privacy and model utility. Accordingly, various example embodiments combine the secret-sharing scheme based on distributed tensor networks with metadata privacy to seamlessly secure big data storage, communication, and sharing. Tensor decomposition is commonly used in data mining to discover patterns or relationships from billion-scale tensor, this involves highly computational-intensive operations and usually done with a pool of GPUs/CPUs. Various example embodiments re-purpose tensor techniques for big data privacy preservation, large dataset can be sub-divided into smaller data chunk / block (about 100K data entries) for parallel distributed computation to speed up the decomposition. The advantages of distributed tensor network representations for secure data storage include information-theoretical security, compression, fine-grained data access control, updatability, and compressed computability. In contrast, existing encryption techniques may provide only computational security (i.e., not quantum-safe); additive secret- sharing schemes may incur at least 2-3 times the storage and communication cost; and most compression algorithms may require decompression to have granular access control, updatability, and computability.

[00105] Metadata may include information on the underlying data and is used in communication and database management systems. Metadata may be broadly categorized into operational, technical, and business metadata according to the specific purposes it serves. In the context of data storage, metadata may serve as the logical “map” for users to navigate through the information and data. Metadata may also help auditors to carry out system review and post-breach damage assessment. After decomposing data and distribute each core or sub-block to multiple clouds/fogs/servers/devices using the example randomized tensor algorithms according to various example embodiments, the actual metadata stays with the data owner and appended with the location and name of each core (randomized tensor block), tensor structure, and the reconstruction algorithms. In various example embodiments, the metadata is encrypted and password-protected to provide layered protection. The metadata of each core stored on the multi-party may include only the anonymized names, location, data structure and type, and user access rights (without the time of access). Accordingly, the example distributed data management method according to various example embodiments provide both distributed trust/management for big data privacy preservation, that is, the data privacy is jointly protected by the security architecture provided by different computing platforms and environments, hackers would have to breach the security of all routes to retrieve the original data, whereas the metadata (i.e., locations, anonymized names of each core, and access rights) can be changed anytime by the data owner according to the data sensitivity and usage. Secure data sharing can be done seamlessly by the data owners by sharing the actual metadata of each record user will then proceed to download from the clouds/servers and reconstruct the original records efficiently on the user side. This removes the scalability bottleneck of repeatedly downloading/decryption/re-encryption processes by the data owner in multi-user or many-user data sharing settings. Bulk re-encryption is not necessary because metadata is unique up to individual record level, any potential leakage is traceable and is also limited to the shared metadata. Accordingly, the example distributed data management method according to various example embodiments is advantageously able to reduce the pain point of key management problem, especially with the distributed, serverless computing and containerized applications. TT-rounding or recompression algorithm Incremental Tensor-Train Decomposition. Experiments

[00106] The privacy leakage of the example secret-sharing scheme based on distributed tensor network according to various example embodiments was analysed in various experiments performed.

[00107] Experimental Setup. The experiments were carried out using a workstation with INTEL Xeon CPU E5-1650 v4 at 3.60GHz RAM 32.0GB. The TT decomposition speed was about 5MB per CPU and reconstruction speed was about 50MB per CPU. Cryptanalysis was performed on the randomized TT representations using Pearson’s correlation and histogram analysis.

[00108] FIGs.17 to 25 show randomized tensor network decompositions of various types of data, e.g., image, audio, video, sensors, graph, and textual data, according to various example embodiments of the present invention. Empirical results show that the randomized tensor blocks are un-recognizable and successfully anonymize the original data. Histogram analysis also shows that the tensor blocks' distribution are always Gaussian or Laplacian-distributed, regardless of the original data distribution. The randomized fragments are orthonormalized and may be multiplied by the randomized scaling factors before tensor network reconstruction. According to various example embodiments, it is found that data complexity may be utilized or applied to mask the data using the randomized tensor network algorithm or method according to various example embodiments of the present invention.

[00109] FIG. 26 shows the image distortion as a result of adding noise into randomly-selected tensor block of different tensor network decomposition, according to various example embodiments of the present invention. If one of the tensor block is unknown (denoted as“random”), the image cannot be reconstructed. On the right-hand side, the distorted images as a result of adding noise are shown, the effect is larger if the perturbations is applied on the important/influential portion of the sub-tensor, however this information is usually unknown to the adversary. CP's distortion is larger because the format is more compact compared to other tensor networks.

[00110] FIG. 27 shows the normalized mutual information (NMI) between tensor blocks that belong to particular image (top row), two different images (bottom row), and random noise (“rand”), according to various example embodiments of the present invention. The results show that they are indistinguishable from each other. NMI is a universal metric such that any other distance measure judges two random variables close-by, NMI will also judge them close. The NMI variation is largely attributed to the variation in tensor blocks value distribution, if the variation in particular tensor block is high (i.e., entropy is high), its NMI with other tensor block is likely to be smaller.

[00111] FIG. 28 depicts a table to benchmark the tensor network decomposition/reconstruction efficiency (in milliseconds) for different datasets. The classification model performance (top-1 accuracy) of compressed tensor blocks is compared against the original dataset without fine-tuning the convolutional neural network. The time needed generally increases with data size. CP decomposition is about 4 times longer than other tensor network decomposition. The decompression time is generally much faster compared to compression time. After decomposition, the tensor blocks are quantized to 8-bit depth. Some of the tensor blocks can be uniformly quantized but some requires non-uniform quantization using Lloyd’s algorithm to reduce the image distortion, e.g., TD’s core G. It can be observed that tensor network generally retains the features for image classification without the need to retrain the model; at least half of the storage size can be saved using tensor network for data compression

[00112] FIG.29 shows an example architecture of an example secret sharing scheme or method based on distributed tensor network, according to various example embodiments of the present invention (e.g., corresponding to the method of distributed data management as described hereinbefore according to various embodiments). As shown, the data may be ingested from multiple databases and pre-processed in Hadoop parallel processing framework for various big data applications, Hadoop may ensure the data availability by duplicating the data to multiple copies and stored in different computing nodes. Accordingly, the method of distributed data management according to various example embodiments provides a big data shredder or dispersal service to distribute the randomized tensor blocks to multiple databases on multiple public clouds. The metadata that comprises the identity information and location information may be stored within the enterprise infrastructure.

[00113] Scalability is an important consideration for both the success of machine learning as well as for big data privacy preservation. Various example embodiments provide a distributed data management method (e.g., secret-sharing scheme) based on distributed tensor network representations and distributed computation that is much more efficient in terms of computational and communication complexity compared to existing SMPC schemes for privacy-preserving machine learning, additionally, the distributed data management method can operate with the classical additive secret- sharing scheme to increase the range of secure operations. Cryptanalysis was carried out to verify that the secret-sharing scheme is secured against semi-honest adversary and the computation is secured under the universal composability framework. The distributed data management method can be combined with existing data-security solutions, such as data anonymization, encryption, and secure enclaves to provide layered protection. The distributed data management method according to various example embodiment may be application in various applications, such as but not limited to, privacy-preserving big data analytic and large-scale numerical computing, as well as federated machine learning and applying differential privacy to limit the privacy leakage.

[00114] While embodiments of the invention have been particularly shown and described with reference to specific embodiments, it should be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the scope of the invention as defined by the appended claims. The scope of the invention is thus indicated by the appended claims and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced.

Claims

CLAIMS What is claimed is:

1. A method of distributed data management using at least one processor, the method comprising:

decomposing, at a source computing node of the at least one processor, data randomly into a plurality of randomized tensor blocks based on tensor network decomposition;

transmitting, at the source computing node, the plurality of randomized tensor blocks to a plurality of distributed computing nodes, respectively; and storing, at a memory associated with the source computing node, metadata associated with the plurality of randomized tensor blocks, the metadata comprising, for each of the plurality of randomized tensor blocks, identity information and location information relating to the randomized tensor block.

2. The method according to claim 1, wherein

the tensor network decomposition is based on singular value decomposition (SVD), and

said data is randomly decomposed into the plurality of randomized tensor blocks based on a perturbation vector configured to randomly distribute singular values associated with the SVD in relation to the plurality of randomized tensor blocks.

3. The method according to claim 2, the plurality of randomized tensor blocks are each compressed based on one or more coding techniques.

4. The method according to any one of claims 1 to 3, wherein said data is an n-th order tensor and each of the plurality of randomized tensor blocks has an equal or a lower order than said data.

5. The method according to any one of claims 1 to 4, wherein said data is big and complex data.

6. The method according to any one of claims 1 to 5, wherein the plurality of distributed computing nodes are non-colluding amongst the plurality of distributed computing nodes.

7. The method according to any one of claims 1 to 6, wherein the plurality of distributed computing nodes each has data security implemented thereat, and each of the plurality of randomized tensor blocks received at the corresponding distributed computing node is subjected to the data security implemented thereat.

8. The method according to any one of claims 1 to 7, wherein

the plurality of randomized tensor blocks are randomly assigned amongst the plurality of distributed computing nodes for said transmission thereto, respectively, and

in the metadata, for each of the plurality of randomized tensor blocks, the identity information relating to the randomized tensor blocks is anonymized and the location information relating to the randomized tensor block corresponds to an address of the distributed computing node which the randomized tensor block is assigned to.

9. The method according to any one of claims 1 to 8, further comprising:

transmitting, at the source computing node, a storage request message to first one or more of the plurality of distributed computing nodes based on the identity information and the location information relating to each of corresponding first one or more of the plurality of randomized tensor blocks for instructing said first one or more of the plurality of distributed computing nodes to store said first one or more of the plurality of randomized tensor blocks received at corresponding one or more memories associated with said first one or more of the plurality of distributed computing nodes, respectively.

10. The method according to claim 9, wherein

the metadata further comprises reconstruction information relating to the plurality of randomized tensor blocks, and the method further comprises:

transmitting, at the source computing node, a retrieval request message to each of the plurality of distributed computing nodes based on the identity information and the location information relating to each of the plurality of randomized tensor blocks for instructing each of the plurality of distributed computing nodes to transmit the plurality of randomized tensor blocks stored at a plurality of memories associated with the plurality of distributed computing nodes, respectively, to the source computing node;

receiving, at the source computing node, the plurality of randomized tensor blocks transmitted from the plurality of distributed computing nodes, respectively, in response to the retrieval request message; and

generating, at the source computing node, a reconstructed data corresponding to said data based on the plurality of randomized tensor blocks received and the reconstruction information in the metadata associated with the plurality of randomized tensor blocks.

11. The method according to claim 9, further comprising:

transmitting, at the source computing node, a computation request message to second one or more of the plurality of distributed computing nodes based on the identity information and the location information relating to each of corresponding second one or more of the plurality of randomized tensor blocks for instructing said second one or more of the plurality of distributed computing nodes to perform a computation on said second one or more of the plurality of randomized tensor blocks stored at corresponding one or more memories associated with said second one or more of the plurality of distributed computing nodes to obtain one or more computed outputs, respectively.

12. The method according to claim 9,

the metadata further comprises reconstruction information relating to the plurality of randomized tensor blocks, and

the method further comprising:

transmitting, at the source computing node, a sharing request message to each of the plurality of distributed computing nodes based on the identity information and the location information relating to each of the plurality of randomized tensor blocks for instructing each of the plurality of distributed computing nodes to transmit the plurality of randomized tensor blocks stored at a plurality of memories associated with the plurality of distributed computing nodes, respectively, to a second computing node; and

transmitting, at the source computing node, the metadata associated with the plurality of randomized tensor blocks to the second computing node.

13. The method according to any one of claims 9 to 12, further comprising:

transmitting, at the source computing node, an update request message to third one or more of the plurality of distributed computing nodes based on the identity information and the location information relating to each of corresponding third one or more of the plurality of randomized tensor blocks for instructing said third one or more of the plurality of distributed computing nodes to perform an update on said third one or more of the plurality of randomized tensor blocks stored at corresponding one or more memories associated with said third one or more of the plurality of distributed computing nodes to obtain a plurality of updated randomized tensor blocks, respectively.

14. A system for distributed data management comprising:

a memory; and

at least one processor communicatively coupled to the memory and configured to:

decompose, at a source computing node of the at least one processor, data randomly into a plurality of randomized tensor blocks based on tensor network decomposition;

transmit, at the source computing node, the plurality of randomized tensor blocks to a plurality of distributed computing nodes, respectively; and store, at the memory associated with the source computing node, metadata associated with the plurality of randomized tensor blocks, the metadata comprising, for each of the plurality of randomized tensor blocks, identity information and location information relating to the randomized tensor block.

15. The system according to claim 14, wherein the tensor network decomposition is based on singular value decomposition (SVD), and

16. The system according to claim 15, wherein

the plurality of randomized tensor blocks are each compressed based on one or more coding techniques.

17. The system according to any one of claims 14 to 16, wherein said data is an n- th order tensor and each of the plurality of randomized core tensors has an equal or a lower rank than said data.

18. The system according to any one of claims 14 to 17, wherein said data is big and complex data.

19. The system according to any one of claims 14 to 18, wherein the plurality of distributed computing nodes are non-colluding amongst the plurality of distributed computing nodes.

20. The system according to any one of claims 14 to 19, wherein the plurality of distributed computing nodes each has data security implemented thereat, and each of the plurality of randomized tensor blocks received at the corresponding distributed computing node is subjected to the data security implemented thereat.

21. The system according to any one of claims 14 to 20, wherein the plurality of randomized tensor blocks are randomly assigned amongst the plurality of distributed computing nodes for said transmission thereto, respectively, and in the metadata, for each of the plurality of randomized tensor blocks, the identity information relating to the randomized tensor blocks is anonymized and the location information relating to the randomized tensor block corresponds to an address of the distributed computing node which the randomized tensor block is assigned to.

22. The system according to any one of claims 14 to 21, wherein the at least one processor is further configured to:

transmit, at the source computing node, a storage request message to first one or more of the plurality of distributed computing nodes based on the identity information and the location information relating to each of corresponding first one or more of the plurality of randomized tensor blocks for instructing said first one or more of the plurality of distributed computing nodes to store said first one or more of the plurality of randomized tensor blocks received at corresponding one or more memories associated with said first one or more of the plurality of distributed computing nodes, respectively.

23. The system according to claim 22, wherein

the at least one processor is further configured to:

transmit, at the source computing node, a retrieval request message to each of the plurality of distributed computing nodes based on the identity information and the location information relating to each of the plurality of randomized tensor blocks for instructing each of the plurality of distributed computing nodes to transmit the plurality of randomized tensor blocks stored at a plurality of memories associated with the plurality of distributed computing nodes, respectively, to the source computing node;

receive, at the source computing node, the plurality of randomized tensor blocks transmitted from the plurality of distributed computing nodes, respectively, in response to the retrieval request message; and

generate, at the source computing node, a reconstructed data corresponding to said data based on the plurality of randomized tensor blocks received and the reconstruction information in the metadata associated with the plurality of randomized tensor blocks.

24. The system according to claim 22, wherein the at least one processor is further configured to:

transmit, at the source computing node, a computation request message to second one or more of the plurality of distributed computing nodes based on the identity information and the location information relating to each of corresponding second one or more of the plurality of randomized tensor blocks for instructing said second one or more of the plurality of distributed computing nodes to perform a computation on said second one or more of the plurality of randomized tensor blocks stored at corresponding one or more memories associated with said second one or more of the plurality of distributed computing nodes to obtain one or more computed outputs, respectively.

25. The system according to claim 22, wherein

the at least one processor is further configured to:

transmit, at the source computing node, a sharing request message to each of the plurality of distributed computing nodes based on the identity information and the location information relating to each of the plurality of randomized tensor blocks for instructing each of the plurality of distributed computing nodes to transmit the plurality of randomized tensor blocks stored at a plurality of memories associated with the plurality of distributed computing nodes, respectively, to a second computing node; and

transmit, at the source computing node, the metadata associated with the plurality of randomized tensor blocks to the second computing node.

26. The system according to any one of claims 22 to 25, wherein the at least one processor is further configured to:

transmit, at the source computing node, an update request message to third one or more of the plurality of distributed computing nodes based on the identity information and the location information relating to each of corresponding third one or more of the plurality of randomized tensor blocks for instructing said third one or more of the plurality of distributed computing nodes to perform an update on said third one or more of the plurality of randomized tensor blocks stored at corresponding one or more memories associated with said third one or more of the plurality of distributed computing nodes to obtain a plurality of updated randomized tensor blocks, respectively.

27. A network system for distributed data management, the network system comprising:

a plurality of distributed servers comprising a plurality of distributed computing nodes, respectively, each distributed server comprising:

a memory; and

at least one processor communicatively coupled to the memory and comprises the corresponding distributed computing node of the plurality of distributed computing nodes; and

a system for distributed data management according to any one of claims 14 to 26.

28. A computer program product, embodied in one or more non-transitory computer-readable storage mediums, comprising instructions executable by at least one processor to perform a method of distributed data management according to any one of claims 1 to 13.