WO2022225578A1 - Real-time event-driven serverless functions within storage systems for near data processing - Google Patents
Real-time event-driven serverless functions within storage systems for near data processing Download PDFInfo
- Publication number
- WO2022225578A1 WO2022225578A1 PCT/US2021/071236 US2021071236W WO2022225578A1 WO 2022225578 A1 WO2022225578 A1 WO 2022225578A1 US 2021071236 W US2021071236 W US 2021071236W WO 2022225578 A1 WO2022225578 A1 WO 2022225578A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- storage
- user
- storage system
- event
- data
- Prior art date
Links
- 238000003860 storage Methods 0.000 title claims abstract description 732
- 230000006870 function Effects 0.000 title claims abstract description 166
- 238000012545 processing Methods 0.000 title claims description 28
- 238000000034 method Methods 0.000 claims abstract description 118
- 238000001514 detection method Methods 0.000 claims abstract description 36
- 230000004044 response Effects 0.000 claims abstract description 22
- 238000013500 data storage Methods 0.000 claims description 123
- 230000008569 process Effects 0.000 claims description 29
- 230000015654 memory Effects 0.000 claims description 27
- 238000004891 communication Methods 0.000 claims description 21
- 230000000977 initiatory effect Effects 0.000 claims description 10
- 238000004422 calculation algorithm Methods 0.000 description 8
- 230000036541 health Effects 0.000 description 7
- 238000013473 artificial intelligence Methods 0.000 description 5
- 238000009826 distribution Methods 0.000 description 5
- 238000000638 solvent extraction Methods 0.000 description 5
- 230000009471 action Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 238000013467 fragmentation Methods 0.000 description 4
- 238000006062 fragmentation reaction Methods 0.000 description 4
- 238000005192 partition Methods 0.000 description 4
- 238000012546 transfer Methods 0.000 description 4
- 206010035664 Pneumonia Diseases 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 3
- 238000004590 computer program Methods 0.000 description 3
- 238000013503 de-identification Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 238000007726 management method Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000012544 monitoring process Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000013468 resource allocation Methods 0.000 description 3
- 239000004606 Fillers/Extenders Substances 0.000 description 2
- 238000007792 addition Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 230000001276 controlling effect Effects 0.000 description 2
- 230000000875 corresponding effect Effects 0.000 description 2
- 230000007274 generation of a signal involved in cell-cell signaling Effects 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 230000001960 triggered effect Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 235000000332 black box Nutrition 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000002591 computed tomography Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000012517 data analytics Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000007170 pathology Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000010076 replication Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000005496 tempering Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24564—Applying rules; Deductive queries
- G06F16/24565—Triggers; Constraints
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
- G06F16/1824—Distributed file systems implemented using Network-attached Storage [NAS] architecture
- G06F16/1827—Management specifically adapted to NAS
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5072—Grid computing
Definitions
- the present disclosure is related to system data storage and, in particular, to providing operations of data-related functions in a data storage system.
- a compute system such as compute servers, performs computations on data and can store the results of the computation in a storage system.
- Image detection or prediction modules or procedures can be applied to evaluate risk followed by application of de-identification modules or procedures to comply with governmental requirements such as the Health Insurance Portability and Accountability Act (HIP A A).
- the de-identified data can be sent to the researchers, and detection results can be sent to physicians.
- This large set of data is used for multiple tasks, where each task may only use a portion of the total data.
- the partial use of a large integrated amount of data is not limited to the health care industry, but is applicable to other industries that use big data.
- UDF user-defined function
- the architecture and protocol can allow users to write their own storage-side UDFs and to register different types of events to storage systems.
- a UDF controller can respond to an event notification and can automatically invoke serverless function deployment to support fully automated data pipeline.
- the serverless deployment can be storage aware via orchestration to select appropriate storage nodes to deploy serverless UDFs based on real-time resources and special hardware.
- Storage-side serverless UDF with full orchestration can be run on the storage-side in a compute system-storage system architecture with near data processing.
- a storage system comprising a memory storing instructions and one or more storage processors in communication with the memory, wherein the one or more storage processors execute the instructions.
- the one or more storage processors execute the instructions to detect a data storage event in the storage system, where the data storage event is initiated from exterior to the storage system. Metadata is extracted from the data storage event detected.
- the one or more storage processors execute instructions to automatically invoke a user-defined function directly within the storage system, based on the metadata extracted, with the user-defined function residing within the storage system.
- the one or more storage processors are operable to store, in the storage system, a result of operation of the user-defined function in the storage system upon completing generation of the result without providing the result to a client source or a compute system that was part of initiation of the data storage event.
- the one or more storage processors are operable to execute stored instructions to register the user-defined function in the storage system prior to detection of the data storage event.
- registration of the user defined function includes registration of one or more parameters for the user-defined function including data for matching to the metadata from the data storage event detected, a trigger condition to respond to the detection of the storage event, one or more user preferences for use of storage resources of the storage system, or a service -level agreement.
- the storage system includes a user-defined function registry that stores user-defined functions and parameters for the user-defined functions.
- the one or more storage processors are operable to provide security measures, specific to the user-defined functions, to storage of the user-defined functions in the user-defined function registry or to retrieval of the user-defined functions from the user-defined function registry.
- the one or more storage processors are operable to execute the instructions to process and analyze an event notification, one or more user-defined functions, or one or more target storage objects, based on the detection of the data storage event in the storage system.
- the one or more storage processors are operable to execute the instructions to orchestrate scheduling operation of the user-defined function and select one storage node of multiple storage nodes in the storage system on which to run the user-defined function.
- the one or more storage processors are operable to execute the instructions to generate a node score to select the one storage node using one or more of metadata of the user-defined function, node configurations of the multiple storage nodes, runtime event information, or runtime storage system information.
- the storage system includes multiple nodes with each node having a storage processor operable to detect a specific data storage event for the node and execute a specific user-defined function in the node in response to detection of the specific data storage event in the node.
- the instructions to automatically invoke a user-defined function directly within the storage system in response to detection of a data storage event in the storage system are portable to different types of storage systems using different storage protocols and capable of operation with one or more protocols that perform storage operations in the storage system.
- a computer-implemented method of storage-side computation comprises detecting a data storage event in a storage system, where the data storage event is initiated from exterior to the storage system and extracting metadata from the data storage event detected.
- the computer-implemented method comprises, in response to detecting the data storage event in the storage system and after completing the data storage event, automatically invoking a user-defined function directly within the storage system, based on the metadata extracted, with the user-defined function residing within the storage system.
- the computer- implemented method includes storing, in the storage system, a result of operation of the user-defined function in the storage system upon completing generation of the result without providing the result to a client source or a compute system that was part of initiation of the data storage event.
- the computer-implemented method includes registering, in the storage system, the user-defined function prior to detecting the data storage event.
- registering the user defined function includes registering, for the user-defined function, data for matching to one or more of the metadata from the data storage event detected, a trigger condition to respond to the detection of the data storage event, one or more user preferences for use of storage resources of the storage system, or a service -level agreement.
- the computer-implemented method includes storing user-defined functions and parameters for the user- defined functions in a user-defined function registry storage in the storage system.
- the computer-implemented method includes providing security measures, specific to the user-defined functions, in storing the user-defined functions in the user-defined function registry or in retrieving the user-defined functions from the user-defined function registry.
- the computer-implemented method includes, based on the detection of the data storage event in the storage system, processing and analyzing an event notification, one or more user-defined functions, or one or more target storage objects.
- the computer- implemented method includes scheduling operation of the user-defined function and selecting one storage node of multiple storage nodes in the storage system on which to run the user-defined function.
- the computer- implemented method includes generating a node score to select the one storage node using one or more of metadata of the user-defined function, node configurations of the multiple storage nodes, runtime event information, or runtime storage system information.
- the computer-implemented method includes performing the detecting of the data storage event and the automatic invoking of the user-defined function directly within the storage system in one node of multiple storage nodes of the storage system.
- performing the detecting of the data storage event and the automatic invoking of the user-defined function directly within the storage system includes using a storage-side protocol of automatic invoking of the user-defined function directly within the storage system integrated with one or more protocols that perform storage operations in the storage system.
- a non-transitory computer-readable medium storing instructions for storage-side computation, which, when executed by one or more storage processors, cause the one or more processors to perform operations.
- the operations comprise detecting a data storage event in a storage system, the data storage event initiated from exterior to the storage system; extracting metadata from the data storage event detected; and in response to detecting the data storage event in the storage system and after completing the data storage event, automatically invoking a user-defined function directly within the storage system, based on the metadata extracted, with the user-defined function residing within the storage system.
- the operations include storing, in the storage system, a result of operation of the user-defined function in the storage system upon completing generation of the result without providing the result to a client source or a compute system that was part of initiation of the data storage event.
- the operations include registering, in the storage system, the user-defined function prior to detecting the data storage event.
- the operations include registering the user defined function including registering, for the user-defined function, one or more of the metadata from the data storage event detected, a trigger condition to respond to the detection of the data storage event, one or more user preferences for use of storage resources of the storage system, or a service -level agreement.
- the operations include storing user- defined functions and parameters for the user-defined functions in a user-defined function registry storage in the storage system.
- the operations include providing security measures, specific to the user-defined functions, in storing the user-defined functions in the user-defined function registry or in retrieving the user-defined functions from the user-defined function registry.
- the operations include, based on the detection of the data storage event in the storage system, processing and analyzing an event notification, one or more user-defined functions, or one or more target storage objects.
- the operations include scheduling operation of the user-defined function and selecting one storage node of multiple storage nodes in the storage system on which to run the user-defined function.
- the operations include generating a node score to select the one storage node using one or more of metadata of the user-defined function, node configurations of the multiple storage nodes, runtime event information, or runtime storage system information.
- the operations include performing the detecting of the data storage event and the automatic invoking of the user-defined function directly within the storage system in one node of multiple storage nodes of the storage system.
- performing the detecting of the storage event and the automatic invoking of the user defined function directly within the storage system includes using a storage-side protocol of automatic invoking of the user defined function directly within the storage system integrated with one or more protocols that perform storage operations in the storage system.
- Figure 1 is an illustration of an example architecture for a storage-side user-defined function for which a storage system interfaces with a compute system, according to various embodiments.
- Figure 2 illustrates features of an example event-driven user-defined function invocation sequence flow in the architecture of Figure 1 , according to various embodiments.
- Figure 3 illustrates an example automatic data pipeline for providing access to health data by different users using a storage-side user-defined function system similar to the architecture of Figure 2, according to various embodiments.
- Figure 4 shows an example of a sequence of user-defined function deployment with respect to node selection, according to various embodiments.
- Figure 5 illustrates an example of factors that influence a decision in a storage-side serverless scheduler to select a storage node for user-defined function execution, according to various embodiments.
- Figure 6 is a flow diagram of features of an example method of storage-side computation, according to various embodiments.
- Figure 7 is a block diagram illustrating components of an example system that can implement algorithms and perform methods structured for real time event -driven serverless functions within storage systems, according to various embodiments.
- the functions or algorithms described herein may be implemented in software in an embodiment.
- the software may comprise computer-executable instructions stored on computer-readable media or computer-readable storage device such as one or more non-transitory memories or other type of hardware- based storage devices, either local or networked.
- modules which may be software, hardware, firmware, or any combination thereof. Multiple functions may be performed in one or more modules as desired, and the embodiments described are merely examples.
- the software may be executed on a digital signal processor, application-specific integrated circuit (ASIC), a microprocessor, or other type of processor operating on a computer system, such as a personal computer, server, or other computer system, turning such computer system into a specifically programmed machine.
- ASIC application-specific integrated circuit
- Computer-readable non-transitory media include all types of computer- readable media, including magnetic storage media, optical storage media, and solid state storage media and specifically excludes signals.
- the software can be installed in and sold with the devices that implement arrangements of compute clusters and storage clusters for artificial intelligence training or other data intense operations as taught herein.
- the software can be obtained and loaded into such devices, including obtaining the software via a disc medium or from any manner of network or distribution system, including, for example, from a server owned by the software creator or from a server not owned but used by the software creator.
- the software can be stored on a server for distribution over the Internet, for example.
- the data is downloaded from the storage system to the servers for the different tasks, which can result in data duplication when there is an overlap of data associated with the different tasks.
- WAN wide area network
- Other issues can include occurrence of a security risk due to exposure of sensitive data.
- Manual processes of a conventional procedure for example in big data processing such as in health care operation, can be error prone and have high labor cost. Replication of generic functions in many applications can result in client-side infrastructure cost.
- an architecture and protocol can be implemented to allow a user, via a user device, to define UDFs in a storage side of the architecture, and to provide a fully automated real-time event-driven data pipeline. This can provide efficient data processing, extra security, lower cost, and worry-free UDF invocation.
- the storage side architecture can be implemented in communication with a compute side architecture.
- the storage side of the architecture can include one or more storage systems.
- a storage system is a system primarily dedicated to storing data and performing operations to create, read, update, and delete data along with other data-oriented operations and associated storage services that can replicate data, make extra copies of data, and take snapshots of the data.
- each storage system can have a set of storage nodes, where a storage node can operate as a storage system.
- a pipeline is a set of data processing elements connected in series, where the output of one element is the input of the next one.
- the elements of a pipeline can be executed in parallel or in a time-sliced fashion.
- UDFs are functions provided by a user that can be built into a program or environment, and run in the program or environment.
- a UDF is most commonly used in the analytics field or the database field, where the UDF can extend the native functionality of the query engine for the analytics field or the database field.
- UDFs are commonly used in the fields of big data analytics (BDA) and artificial intelligence (AI), where the native functionality of the compute platforms are not enough to cover the complex logics in BDA and AI computation.
- BDA big data analytics
- AI artificial intelligence
- UDFs can be characterized by bad performance, for example a small amount of UDFs in a conventional analytics engine for big data processing can take up to 70% of the central processing unit (CPU) time for a number or reasons.
- UDFs often run as a black-box on the compute side and are difficult to implement in an optimized manner in a query engine such as a structured query language (SQL) engine. UDFs often cannot be pushed down to the data source, so a huge amount of data is to be shipped from the storage system to the compute system, where the compute system cannot perform near data processing (NDP).
- SQL structured query language
- UDFs are currently most used in the context of a SQL query, UDF scope in a query is typically too narrow.
- an architecture can be implemented in which storage-side UDFs can be pushed down to a storage layer and operate on storage objects in-place directly in the storage side.
- the storage-side UDFs defined by a user are NDP procedures whose computations in the storage side can be more efficient than enterprise storage system processing and cloud storage backend processing.
- the storage-side UDFs can be provided for serverless computing, which is a method of providing backend services on an as-used basis. Servers are still used in serverless computing, where serverless computing is charged based on usage, not on a fixed amount of bandwidth or number of servers. Current conventional serverless computing only exists on the compute-side of a compute-side - storage-side architecture.
- the storage-side UDFs can be implemented as orchestrated serverless functions on a storage system.
- the architecture of storage-side UDFs can be integrated into storage systems or cloud storage backend having a number of characteristics.
- the storage systems for integration can be generic storage systems, in which UDFs on any format of (big data) object types like comma- separated values (CVS), JavaScript object notation (JSON), columnar, images, streaming, and other object types are supported.
- the UDFs in the storage-side architecture can use standard storage protocols.
- the storage systems can run in their own private storage network, which can have or not have internet connection for security and performance, where some common practices for compute-side serverless operation are modified.
- the storage-side UDFs can be implemented to run directly on storage controllers.
- the storage controllers can include one or more storage processors and associated memory having stored instructions to manage and run the UDFs. These storage controllers can be assigned many other tasks especially main tasks in storage operations, and can support multi-level service -level agreements (SLAs).
- SLA defines the level of service expected by a customer from a supplier, laying out the metrics by which that service is measured.
- the SLA can include agreed upon remedies or penalties for service levels not achieved.
- These storage controllers can control different sets of storage objects, including orchestration of data locality, and can have different resources and hardware composition with orchestration of resource and hardware affinity.
- Storage systems normally have significant internal metadata information from caching, partition, cataloging/indexing, data protection etc., which are only available to storage systems.
- Storage-side UDFs and their associated services run on storage systems directly and can use normally generated information for a node selection algorithm to improve performance of the storage-side UDFs in storage systems having multiple nodes.
- a storage-side control module comprising instructions executable by storage processors, can use information that the storage system, with which the storage- side control module is correlated, obtains from internally breaking down objects or obtains from different partitions or shardings.
- An example of internally breaking down objects can include error correction (EC). Partitioning deals with grouping subsets of data within a single database.
- Sharding deals with optimizing database management systems by separating rows or columns of a database table into multiple smaller tables.
- Sharding and partitioning both relate to dispersing a large data set into smaller subsets.
- sharding implies that the data is spread across multiple computers, while partitioning typically deals with a single database instance.
- the storage-side control module can behave like any storage client but without running UDFs on the compute-side. Better performance can be achieved by the storage-side control module controlling the running of the UDFs on the storage-side, where data is closer and safer to access from a security standpoint.
- FIG. 1 is an illustration of an embodiment of an example architecture 100 for a storage-side UDF for which a storage system 105 interfaces with a compute system 102.
- the compute system 102 can include a number of instrumentalities that execute various computation operations.
- the compute system 102 can include a command line interface (CLI) or a graphical user interface (GUI) 103, a compute module 104, a custom computation module 106, or other structures that provide input to a software development kit (SDK) 107.
- An SDK functions to provide a set of tools, libraries, relevant documentation, code samples, processes, and or guides that allow developers to create software applications on a specific platform.
- the compute system of the compute-storage architecture performs computations and stores data in the storage system of the compute-storage architecture.
- the data is transferred from the storage system of the conventional compute-storage architecture back to the compute system of the conventional compute-storage architecture for computation, where results are stored in another location in the storage system of the conventional compute-storage architecture.
- the storage system 105 can include a number of storage servers 110-1, 110-2, . . . 110-N.
- Each of the storage servers 110-1, 110- 2 . . . 110-N can be structured with similar or identical types of instrumentality arranged in a similar or identical manner.
- the native server 112 can include one or more storage processors and stored instructions to control operations on data to store, delete, access, copy, and perform other conventional operations related to storage of data.
- the native server 112 can operate in conjunction with compute system 102 with respect to the conventional operations related to storage of data from the compute system 102.
- the SDK 107 of the compute system 102 can operationally communicate with the native server 112 to perform conventional data operations.
- the native server 112 can include other event target cluster system 126 that receives storage event information from the native server 112, and can send notification to the event listener service 119.
- Each of the storage servers 110-1, 110-2, . . . 110-N can include an event service 113 to monitor data storage activity of the storage servers with respect to the physical volumes of storage and to interface with a storage-side UDF controller system.
- the event service 113 can identify occurrence of a data storage event.
- a data storage event is an operation in a storage system related directly to data. Examples of a data storage event can include uploading data, downloading data, copying data, deleting data, getting data, listing data, opening a file, and similar data oriented operations.
- a data storage event does not include error-based operations dealing with data storage.
- the event service 113 can monitor command and address lines to the physical volumes of storage to determine action taken on data.
- the event service 113 of storage servers 110-1 can interface with a storage-side UDF controller system that is implemented as a NDP server 115-1.
- the NDP server 115-1 can be integrated within storage server 110-1 as a definable set of instructions to operate with respect to one or more storage-side UDFs using one or more storage processors of storage server 110-1.
- NDP server 115-1 can be integrated within storage server 110-1 as designated hardware including one or more storage processors to execute instructions of the NDP server 115-1 to operate with respect to one or more storage-side UDFs.
- the NDP server 115-1 can be implemented as a standalone structure coupled to storage server 110-1, where such a standalone structure includes a set of instructions to operate with respect to one or more storage-side UDFs using one or more storage processors of the NDP server 115- 1.
- Each of the storage servers 110-1, 110-2, . . . 110-N can be arranged with respect to each of NDP servers 115-1, 115-2, . . . 115-N, respectively, in a manner similar to the NDP server 115-1 arranged with the storage server 110-1.
- Each of the NDP servers 115-1, 115-2, . . . 115-N can be structured in a similar or identical manner.
- the NDP server 115-1 can include a NDP service 114, a UDF service 116, a function-as-a-service (FaaS) service 117, a UDF registry service 118, and an event listener service 119.
- FaaS function-as-a-service
- the NDP service 114 can include, but is not limited to, a storage client base service 135 to control and manage one or many different storage services that behave as storage clients that communicate with underlining storage servers in the storage systems.
- Typical services can be an object storage service 131, a cloud storage service 132, a file storage service 133, and other storage services 134 such as block storage services, special HDFS (Hadoop Distributed File System) service etc. These services can be implemented as a scalable, high speed, cloud native storage service.
- the NDP Service 114 can be implemented as a storage-side representational state transfer (REST) service that can accept and process common storage requests by operating in compliance with standard protocols.
- REST typically defines a set of constraints for the manner in which an architecture, such as an Internet-scale distributed hypermedia system, behaves.
- REST can provide for scalability of interactions between components, uniform interfaces, independent deployment of components, and a layered architecture to facilitate caching components to reduce user-perceived latency, enforce security, and encapsulate older systems.
- the storage-side NDP Service 114 can handle more than storage requests, it can include a protocol to process a UDF request as part of storage requests for direct invocation of UDFs. The direct invocation can be automatic initiation of the UDFs.
- the NDP Service 114 can be implemented with a storage system plugin design to allow UDF support to be highly portable in any storage system and cloud storage backend such that the NDP Service 114 can support more than a specific storage system.
- the NDP Service 114 can operate in conjunction with the compute system 102, a client 101 separate from the storage system 105 and separate from compute system 102, and other components of the NDP server 115-1.
- the UDF service 116 can manage and execute UDF operations on storage system 105 without uploading data to compute system 102 for computations.
- the UDF Service 116 can be implemented as a storage-side REST service to validate and invoke UDFs with the option of using serverless or standalone containers.
- the UDF Service 116 can be initiated from input from the event listener service 119. Results of the invocation of the UDF Service 116 can be provided back to the NDP Service 114.
- a UDF function in the storage system 105 of the architecture 100 can be a complete serverless UDF that compiles, publishes, and deploys the UDF as a serverless function that combines user-defined function and common boilerplate code.
- the UDF function can read from and write to storage directly via a storage client within the storage system 105.
- the UDF service 116 can operate in conjunction with the FaaS service 117 and can send storage locality and resource information to the FaaS service 117 to optimize running of functions.
- the FaaS service 117 can be implemented as a serverless client and a service to manage serverless functions. FaaS can be implemented as a serverless platform to allow users to execute code on storage system 105 as a network edge. With FaaS, users such as developers can build a modular architecture, creating a base for code that is more scalable without having to implement resources or maintain an underlying backend.
- the UDF registry service 118 can be implemented as a storage-side REST service that provides REST application programming interfaces (APIs) to manage UDF registration in the storage-side.
- An API is a shared boundary for which two or more separate components of a system exchange information, where the information defines interactions between the two or more separate components.
- An API can include multiple software applications or mixed hardware-software intermediaries.
- Implementation of the UDF registry service 118 can be implemented based on an event target cluster 120 and underlying storage of the event target cluster 120.
- the UDF registry service 118 can be implemented in conjunction with a UDF hosted hub repository service provided for finding and sharing container images.
- Such a UDF repository can be implemented as a UDF registry 122 of the event target cluster 120.
- the UDF registry service 118 provides a mechanism to associate a UDF with a storage- side event and can operate in conjunction with the event listener service 119.
- the event listener service 119 can identify an occurrence of a data storage event in conjunction with the event service 113 of the storage server 110- 1, which in conjunction with operation of the UDF registry service 118 can initiate operation of the associated UDF by the UDF service.
- the event listener service 119 can be implemented as a storage-side REST service that listens to registered streaming sources.
- the event listener service 119 can react to a storage-side event and automatically invoke UDFs based on the storage information obtained upon certain storage actions and events.
- the event listener service 119 can extract metadata from monitoring a data storage event.
- Extracting metadata can be implemented by reading the data stream or command structure of the data storage event.
- the event listener service 119 can determine if the metadata of the data storage event matches parameters for invoking and running a storage-side UDF in the storage system. If the metadata of the data storage event does not match the parameters, invocation of the storage-side UDF is not initiated. In contrast, conventional invoking of a compute-side UDF is performed manually without storage awareness.
- the NDP server 115-1 can operate with an event target cluster 120 to control and manage storage-side UDFs.
- the event target cluster 120 can operate as a system for queueing tasks of the NDP server 115-1.
- the event target cluster 120 can be implemented as an in -memory data structure store, which can be used as a database, a cache, and message broker.
- the event target cluster 120 can include the UDF registry 122, which provides a repository for UDFs, and an event queue 124.
- the UDF registry 122 can interface with the FaaS service 117 of the NDP server 115-1 to schedule UDFs for operation based on the event queue 124.
- the event queue 124 can schedule operations associated with identified data events determined in communication with the event listener service 119 of the of the NDP server 115-1. Similar to arrangement and functionality of storage server 110-1, each of the storage servers 110-1, 110-2 . .
- . 110-N can be structured with similar or identical types of instrumentality arranged in a similar or identical manner and can operate in a similar or identical manner.
- the architecture 100 for a storage-side UDF can be implemented as a software architecture, using one or more processors, that provides an automatic event-driven storage-side serverless UDF for data pipeline.
- the architecture 100 and associated protocols allow users to write their own storage functions and to register different types of events to storage systems based on storage operations, time, and alerts.
- the user can define the parameters for the triggering events such as the storage operations, time, and alerts.
- the architecture 100 and associated operating parameters include event notification and auto serverless function deployment to support a fully automated data pipeline.
- the serverless deployment can be storage aware via orchestration to perform a selection process directed to identification of optimal storage nodes of the storage system to deploy the serverless UDFs based on real-time resource and hardware availability, such that a user can avoid running the UDF on an information technology (IT) infrastructure that would lead to complex infrastructure management.
- the architecture 100 for a storage-side UDF implemented as a software architecture can provide a highly portable storage- side UDF framework, which can be a storage-system plugin architecture and associated protocols that can be easily integrated and deployed into any storage system and cloud storage backend. This software architecture can allow for operating the UDF in systems that previously could not support the operation.
- Example architecture 100 can be implemented with one or more compute clusters and/or one or more storage clusters.
- a compute cluster can include multiple compute systems, where the compute systems can be structured similar or identical to compute system 102.
- a storage cluster can include multiple storage systems, where the storage systems can be structured similar or identical to storage system 105. In a data center, these compute systems and storage systems can be connected via a network, and can work together to execute AI operations, analytics operations, or other complex operations.
- Figure 2 illustrates features of an embodiment of an example event- driven UDF invocation sequence flow in the architecture 100 of Figure 1.
- Operations of the sequence flow can be performed by one or more storage processors executing stored instructions for real-time event-driven serverless functions within the storage system 105.
- the sequence flow can be performed in a storage system having multiple storage servers operating as multiple storage nodes. For ease of discussion, the sequence flow is described with reference to NDP server 115-1 and storage server 110-1, though a number of storage-side NDPs and storage servers can be used.
- a registration event is conducted.
- the client 101 registers notification events to the storage system 105 regarding data of interest categorized according to storage groups, which storage groups can be arranged in storage instruments of the storage system 105 as labelled buckets or folders.
- the client 101 can be a user device directly controlled by a user or can be a system such as but not limited to an Internet of Things (IoT) device.
- IoT is a network of physical devices, vehicles, home appliances and other items embedded with electronics, software, sensors, and actuators, which are enabled to connect and exchange data allowing for direct integration of devices in the physical world into computer-based systems.
- the storage system 105 sets up the event target cluster 120 to listen for the registered events.
- the client 101 also registers one or more UDFs into the NDP server 115-1.
- the storage-side UDF protocol can indicate the trigger condition with parameters that can be set via client 101.
- the parameters can be data to be matched with metadata in a data storage event to initiate a notification procedure to invoke running a UDF in the storage system after completion of the data storage event.
- Such registered parameters can include file type, file name, data type, or other parameter to identify a UDF to be invoked.
- a UDF registration process can be undertaken for each client of the storage system 105.
- the UDF registration process can be conducted for each different set of data associated with the client 101.
- the event listener service 119 sets up a notification task with the event target cluster 120 to listen for the registered event notifications.
- the event listener service 119 includes logic to identify storage action types and storage object information.
- client 101 can upload data to the storage system 105.
- the data can be but is not limited to image data.
- the client 101 uploads files to the registered bucket and performs a storage operation like upload, download, copy, delete, get, and list, etc.
- One or more of these operations can be an event to trigger one or more notifications.
- a notification can be based on occurrence of multiple storage operations.
- the protocol for the storage-side UDF system can have other trigger events.
- the storage system 105 finishes the storage operations requested by the client 101 in storage server 110-1 and event service 113 of storage server 110-1 sends notification of event to the event target cluster 120.
- the notification can send identification of the event in addition to notification of the occurrence of a data storage event.
- the notification can identity the data bucket and other characteristics of the data storage event. This generation of the notification is an automatic call.
- the event target cluster 120 calls back to the event listener service 119 regarding the event notification.
- the event target cluster 120 can also be on the storage side of architecture 100 to improve speed of the notification.
- the event listener service 119 communicates with the UDF registry service 118 to find registered UDFs with matched event type and buckets.
- the UDF registry service 118 can check with the event target cluster 120 to determine registered UDFs with matched event type and buckets to the data operation that triggered the notification.
- the event target cluster 120 can access UDF registry 122 of UDFs and the event queue 124 within the event target cluster 120.
- the event listener service 119 calls the UDF registry service 118 for matched UDFs.
- the storage-side UDF system realized by the NDP server 115-1 can support storage-side registry and private registry to improve response time and security.
- the event listener service 119 calls the UDF service 116 for UDF invocations with storage bucket, objects, and UDF information.
- the UDF service 116 calls storage system 105 to get storage object location and resource allocation, and then sends the storage information to the serverless framework of the FaaS service 117 for it to deploy serverless function into the optimal storage server (node) of the storage servers 110-1 . . . 110-N.
- the serverless UDF function and storage information can be deployed by the FaaS service 117 as a UDF container.
- the serverless UDF function applies the UDF function computation to the storage object, which generates a result storage object.
- the result storage object can be stored directly in the storage system 105 without control by the compute system 102.
- the UDF service 116 calls the NDP Service 114 to upload the result object.
- the NDP service 114 can support many storage plugins, which allows the storage-side UDF system to be implemented easily with many storage system types.
- the NDP service 114 calls storage system-specific SDK on the storage-side to upload the result object to a result bucket registered in storage system 105.
- the NDP service 114 can also upload the result object to the compute system 102, depending on the UDF function, which may be sent to a display coupled to the compute system 102 or remotely over a network.
- the example event-driven UDF invocation sequence flow of Figure 2 can be applied to automatically generate a thumbnail in response to storage of an image.
- the storage system Upon upload of a large image into an event -registered object bucket, the storage system, using its integrated storage-side UDF system, can auto-generate a thumbnail of the large image via a registered UDF within the storage system.
- the thumbnail as part of the storage-side UDF procedure, can be stored to another object bucket of the storage system.
- User access to the thumbnail via a request by a compute system typically at a later time, can be accomplished by accessing the object bucket.
- Figure 3 illustrates an embodiment of an example automatic data pipeline for providing access to health data by different users using a storage- side UDF system similar to the architecture of Figure 2.
- the use of a storage- side serverless UDF can provide for real-time, event-driven, scalable, automated, and near data processing, which can provide a data pipeline.
- a patient’s X-ray provides raw data that can be used by both a clinic and a researcher, though each may only use a portion of the raw data.
- a new X-ray of a patient is uploaded into a raw X-ray data bucket 311-1 of a storage instrument 310 in a storage system 302.
- a bucket notification service assigned to raw X-ray data bucket 311-1 fires an event notification to an event target cluster 320.
- the event notification can be a notification of object creation.
- identification of the new event is received in a storage-side UDF controller 330.
- the storage-side UDF controller 330 can be implemented as a NDP server.
- the identification of the new event can be received in an event listener service 319 of the storage-side UDF controller 330 from the event target cluster 320.
- the event listener service 319 obtains UDFs that match the event by looking up the UDFs that are registered to the detected new event.
- the event listener service 319 can invoke the matched UDFs via a UDF service 316.
- the UDF service 316 calls a serverless platform in which a deployment coordinator 345 automatically deploys a serverless function container 350, using UDF registry 318.
- the serverless function container 350 can be generated as a pod.
- a pod is a smallest deployable computing unit in a container scheduling and orchestration environment. It is a collection of one or more containers that operate together, where the containers within a pod share common networking and storage resources from a host node, as well as specifications that determine how the containers run.
- the automatic deployment can be generated after automatic execution of a node selection algorithm to determine the storage node of the storage system 302 on which to execute the matched UDFs.
- the serverless function container 350 retrieves the image loaded into the raw X-ray data bucket 311-1.
- the server less function container 350 operates on the retrieved data using the UDFs that matched the detected event.
- the UDFs that matched the detected event can be a function to detect pneumonia from the X-ray, which can be used by the clinic, and a function to de -identify the uploaded X-ray image from the patent, which can be used by the researcher.
- the respective results from executing the UDF to detect pneumonia are automatically saved in an enriched X-ray data bucket 311-2, which can be accessed by the clinic.
- results can be one or more processed images after pneumonia risk detection.
- the respective results from executing the UDF to de -identify the uploaded X-ray image, which results can be one or more de-identified images, are automatically saved in a de-identified X-ray data bucket 311-3, which can be accessed by the researchers.
- the NDP- storage-side UDF support With running the NDP- storage-side UDF support, the entire process becomes a data pipeline that is secure, efficient, and fully automated.
- the event registration mechanism can be implemented with a user filling in one simple configuration file with the storage-side UDF system automatically setting up the invocation process of UDFs in response to detecting presence of real-time events.
- registration of a UDF is conducted with the parameters that a put or copy event on a bucket labelled image triggers an invocation of a UDF. Also specified are parameters for operation of the UDF.
- the specified parameters include the preferred type of processor to use and the storage server of the storage system to use in executing the invoked UDF.
- the user prefers running on a graphics processing unit (GPU), but a CPU is also okay if there is no GPU available.
- the GPU hardware characteristic is associated with a weight of 80 in the range 1-100, which represents a level of priority in selecting storage-side instrumentalities in the storage system for running the UDF.
- the input parameters also included the preference of running on the storage server that controls this image but this preference can be ignored in the selection process if storage servers are of equal distance to the data.
- the priority of 80 for using a GPU is higher than the priority of 50 for using a NDP.
- Event-driven invocation of a UDF can include, but are not limited to, invocation upon object download; invocation upon object upload; invocation upon object copy; invocation upon object access/get; invocation upon time; invocation upon storage alerts. Events to optional storage information can also be configured.
- Thes data storage events can be identified in the storage system from extracting metadata from the data stream to the storage system for the data storage event.
- UDFs which can be triggered by these events, are stored and registered in the storage system rather than a compute system.
- the UDFs can be written in any programming languages, any language version, and any build systems.
- Examples of such programming languages include, but are not limited to Java, Scala, and Python.
- Examples of such build systems include, but are not limited to, Maven and Gradle.
- All the UDF data can be stored in a hosted repository service, in a storage-side UDF system, for finding and sharing container images with automatic builds and automatic builds of container images.
- the UDF data can be stored in a conventional system, such as a Docker Hub, with proper user credentials.
- the metadata information such as UDF invocation condition can be stored in an “annotations” field of each UDF in the Docker Hub. More metadata fields can be added as needed to support more complex applications of a storage-side UDF system.
- a storage-side UDF system supports direct invocation of the storage-side serverless UDF.
- Amz S3 is a storage instrumentality for the Internet that has a simple web service interface to store and retrieve data at anytime from anywhere on the Internet.
- Amz S3 protocol is a storage instrumentality for the Internet that has a simple web service interface to store and retrieve data at anytime from anywhere on the Internet.
- BucketKeyEnabled x-amz -request -payer RequestPayer x-amz-tagging: Tagging x-amz -obj ect-lock-mode : Obj ectLockMode x-amz -object-lock-retain-until -date: ObjectLockRetainUntilDate x-amz -object-lock-legal-hold: ObjectLockLegalHoldStatus x-amz -expected-bucket -owner: ExpectedBucketOwner x-amz-meta-storage-side_udf_controller_udf_name: "storage - side_udf_controller-faas-spring-thumbnail" x-amz-meta-storage-side_udf_controller_input_parameters: [400, 600] #size of the thumbnail x-amz-meta-storage-
- the instructions for executing a storage-side UDF for generating thumbnails and placing the generated thumbnails in a thumbnails bucket is tacked on to the end of a section of the Amz S3 protocol.
- Storage systems in which a storage-side UDF controller is integrated can be provided with the option of ignoring these tacked-on sections to operate without the use of the storage-side UDF controller.
- a storage-side UDF system can include one or more different and varied features.
- Such a storage-side UDF system can be implemented with a user device being able to issue a storage request, while a UDF operation is being performed.
- a user device can upload an image file to a storage bucket folder, while a request can be issued to run a storage-side UDF that can create a thumbnail from the newly uploaded image and store the thumbnail into a separate storage bucket.
- UDF information can be embedded in the metadata section for the thumbnail.
- a storage- side UDF controller can be implemented to determine into which storage node of the multiple storage nodes to deploy the serverless UDF. In making such a selection, a number of items can be taken into account.
- a user device can define metadata for the UDF.
- the user-defined metadata can include a parameter to indicate whether the given UDF, which is registered, uses special hardware like a GPU to run.
- the UDF service of storage-side UDF system can analyze the following: UDF metadata information, such as, but not limited to, whether a GPU is preferred to run the UDF; event information, such as, but not limited to, the identity of one or more storage buckets and objects mapped to these storage buckets; storage system information such as, but not limited to, which storage node controls or is close to the storage bucket containing the data and does this storage node have the proper instrumentality such as a GPU.
- the UDF service can send, to an orchestrator, the identified storage node, as the selected most appropriate storage node to run the UDF, and the corresponding storage information.
- the orchestrator can then deploy the UDFs to the selected storage node based on the selection information.
- the orchestrator can be implemented as a set of instructions, which can include operation as a serverless platform that can offer database and storage services.
- internal storage information can be used to influence the serverless UDF’s scheduling process.
- Storage awareness features related to automating deployment, scaling, and management of containerized applications can be implemented via conventional advanced scheduling service by setting configuration and calling a runtime scheduler extender for serverless functions, for example running as pods, to deploy and run a matched storage- side UDF on the most desired storage node of the storage system.
- Figure 4 shows an example of a sequence of a UDF deployment with respect to node selection.
- the sequence can be performed in a storage system in an architecture similar or identical to architecture 100 of Figure 1, where the storage system includes multiple storage nodes 405-1 . . . 405-N.
- Storage node 405-1 can include a storage 410-1 having storage services 412-1 and a storage- side UDF controller 430 -1.
- the storage-side UDF controller 430-1 can be implemented as a NDP server.
- the storage-side UDF controller 430-1 can include a NDP service 414-1, a UDF service 416-1, and an event listener service 419-1.
- the storage node 405-N can include a storage 410-N having storage services 412-N and a storage-side UDF controller 430-N.
- the storage- side UDF controller 430-N can be implemented as a NDP server.
- the storage- side UDF controller 430-N can include a NDP service 414-N, a UDF service 416-N, and an event listener service 419-N.
- An UDF orchestrator 440 and an event target cluster 420 can be implemented on the storage-side of a compute system-storage system architecture.
- the event target system cluster 420 on the storage-side of the storage system calls back the event listener service 419-1 for notification.
- the event listener service 419-1 calls UDF service 416-1 for UDF invocations with storage bucket, objects, and UDF information.
- the UDF service 416-1 calls UDF registry 422 to find the matched UDFs and their metadata information.
- the UDF registry 422 can also function as an event target cluster or as a portion of an event target cluster similar to UDF registry 122 of the event target cluster 120 of Figure 1. If a match is found, then, at operation 4 of the sequence, the UDF service 416-1 calls NDP service 414-1 for storage information.
- the NDP service 414-1 calls storage system to get storage object location and resource allocation information via storage system SDK with the storage bucket, objects, and UDF information from the event listener service 419-1.
- the UDF service 416-1 can send the storage information to the orchestrator 440.
- the orchestrator 440 can deploy serverless functions to the optimal storage node. For example, the orchestrator 440 can deploy serverless function x 450-1 and serverless function y 450-2 to storage node 405- 1, while deploying serverless function z 450-3 to storage node 405-N.
- Storage node selection for server less UDFs can be realized using storage awareness features that can be implemented into container orchestration via a scheduling service by setting configuration files and calling a runtime scheduler for running serverless UDF as pods.
- Container orchestration can involve automation of all aspects of coordinating and managing containers and can be focused on managing the life cycle of containers and their dynamic environments. There are a number of tasks that can be automated via container orchestration.
- An orchestrator can configure and schedule containers, provision and deploy containers, and manage availability of containers.
- An orchestrator can control configuration of applications in terms of the containers in which these applications run and can control scaling of containers to equally balance application workloads across an infrastructure.
- An orchestrator can manage allocation of resources between containers, manage load balancing, traffic routing and service discovery of containers, manage health monitoring of containers, and manage security of interactions between containers.
- An orchestrator can be implemented via instructions stored in a memory device and executed by one or more processors.
- the selection of a storage node from among multiple storage nodes of a storage system to execute a matched UDF can include multiple activities.
- the multiple activities of an example procedure can include a filter process, a scoring process, a select process based on the scoring results of the scoring process, and a binding process.
- the filter process can include using a filter that can label a node as a pod, and then filter out nodes that are not qualified.
- the filter can perform pod-node matching that checks conditions of the node and matching of preferences and requirements for the node, while considering preferences and instrumentality not to be included. The pod-node matching can be performed during scheduling rather than during UDF execution.
- the filter process can include pod matching to consider topology of appropriate nodes (pod affinity) and inappropriate nodes (pod anti-affinity).
- the filter process can include pod distribution that checks service affinity, which allows users to define an affinity or anti-affinity relationship of a service to another service.
- the filter process can include a hardware filter to identify whether a node has special hardware, for example, but not limited to, a GPU.
- the filter process can include a sampling procedure with respect to the nodes associated with the storage system.
- a configuration ratio, R can be set to an amount less than the total number of nodes of the storage system. If a number of nodes of a cluster is N, then the number of nodes to find can be set to the maximum of (N*R, Min), where Min is a default minimum. The default minimum is less than the total number N.
- R the configuration ratio
- Min for a large number of nodes can be set to, but is not limited to, 10%. If the number of nodes N of an cluster, in this example, is 3,000, then the number of nodes to find can be set to the maximum of (3000*10/100, 100), where 100 is the default minimum.
- Another example of finding a configuration number of nodes is to consider a range relative to a selected number.
- the number of nodes to find can be set to the maximum of (range- (total number of nodes in the cluster/the selected number). In a non limiting example, the number of nodes to find can be set to the maximum of ((5,50)-(the number of nodes in the cluster)/125). In other instances, all nodes may be considered in the filtering and scoring procedures.
- the scoring process can include assigning a weight for node preference in a pod configuration during scheduling.
- the weight can be a number assigned within a specified range.
- the score can be a number assigned in the range (1-100) for node preference in the pod configuration.
- the scheduler which can be an orchestrator, can compute a sum for a given node by iterating through elements of preference fields for a storage-side UDF and adding a weight to the sum if preferences of the given node match corresponding elements in the iteration. This score is then combined with the scores of other priority UDFs for the node. A node or multiple nodes with the highest total score can be labeled as most preferred.
- An example first scenario can include a given UDF having a parameter set to a preference to run on a GPU and on the closest storage node, where the GPU preference has a higher score weight.
- An example second scenario can include a given UDF having a parameter set to node affinity, which is a preference to run on a storage node that is the closest node to the data on which the given UDF is to operate.
- An example third scenario can include a given UDF having a parameter set to a node affinity for requiring a GPU node.
- An example fourth scenario can include a first UDF and a second UDF having a parameter set to always run the first and second UDF in sequence, which can lead to a preference for the two UDFs to be co-deployed for the same storage node, which is an example of pod affinity.
- An example fifth scenario can include a first UDF and a second UDF having a parameter set to deploy the first UDF and the second UDF on different nodes based on both the first and second UDFs being CPU-intensive. Different pods can have different processing instruments. Each of these scenarios can include use of weights in the scoring process.
- the scoring process can include a scheduling algorithm related to node specific scoring for resource usage.
- resource usage scoring can include priorities associated with preferential distribution, preferential stacking, fragmentation rate, and node spread.
- an idle resource rate can be considered.
- the idle resource rate can be generated as a ratio of a difference between ahocatable and request to the ahocatable.
- Request indicates the resources that have been allocated to nodes.
- Ahocatable indicates the resources that can be scheduled to nodes. The higher the idle resource rate, the higher is the score.
- a resource usage can be considered.
- the resource usage can be generated as a ratio of request to ahocatable.
- Request indicates the resources that have been allocated to nodes.
- Ahocatable indicates the resources that can be scheduled to nodes. The higher the resource usage, the higher is the score.
- fragmentation rate a processor and memory resources can be considered.
- the fragmentation rate can be generated as an absolute value of a difference between CPU usage and memory usage.
- the CPU usage is taken as a function of a ratio of request to ahocatable and the memory usage as a function of the ratio of request to ahocatable.
- Request indicates the resources that have been allocated to nodes.
- Ahocatable indicates the resources that can be scheduled to nodes. The higher the fragmentation rate, the lower is the score.
- node spread a number of pods in a container to deploy for the UDFs matched to detected storage events can be considered.
- the node spread can be generated as a ratio of a difference between the total number T of pods in a container and a statistical value N of nodes to the total number T. The higher the node spread, the higher is the score.
- one or more of the nodes is selected as a storage-side host to execute one or more storage-side UDFs. Once the storage-side host is selected, the information for the UDF execution can be bound to the selected storage- side host as a pod.
- Figure 5 illustrates an example of factors that influence a decision in a storage-side serverless scheduler 540 to select a storage node for UDF execution.
- the serverless scheduler 540 can be implemented as an orchestrator in a architecture similar to the architectures of Figures 1 and 4.
- the serverless scheduler 540 can receive UDF metadata information 562, storage node labels 564, runtime event information 566, and runtime storage system information 568.
- the UDF metadata information 562 can include UDF configuration data
- the storage node labels 564 can include node configurations.
- the runtime event information 566 can include information on objects, buckets, and storage actions.
- the runtime storage system information 568 can include bucket and object owner information.
- the information provided to the schedular 540 can be used to in a node score calculation algorithm.
- a non-limiting example can include a score for each node generated as
- the schedular 540 selects the node with the highest score to schedule the pod.
- the highest score can be a highest normalized score. If multiple nodes have the same highest score, then the schedular 540 can pick one randomly. Alternatively, If multiple nodes have the same highest score, then the scheduler 540 can pick more than one randomly.
- a method and a system for invoking storage- side UDFs are provided that are resilient and orchestrated based on storage events to form a fully-automated data pipeline.
- Methods can include registration of storage-side UDFs from a user device.
- An interface of UDF configurations can be introduced to include one or more trigger conditions, user preferences in storage resources, and SLA.
- an add-on metadata section can be added to the standard storage protocols in a defined and separable manner.
- Such methods for using storage-side UDFs can include storage and retrieval of UDFs from a storage-side UDF registry.
- a storage-side private UDF registry can be introduced to provide capability for extra security measures, especially in network security.
- Some embodiments providing enhanced security can include no internet exposure, sensitive data stays in the storage system, avoidance of attacks like man-in-the middle attacks, avoidance of UDF tempering, and other security features.
- storage-side execution of UDFs can avoid a co on practice of using cloud-based public registry that might expose security risks.
- Storage-side UDF registry can also provide better performance because some UDF data, such as images, can be large, and storage-side UDF registry can reduce external network transportation.
- Such methods for using storage-side UDFs can include the processing and analysis of event notifications, UDFs, and target storage objects.
- a mechanism can be provided to collect and analyze multi-dimensional metadata information from storage systems, storage objects, events, and UDFs. Since the services for the storage-side maintain, storage, and execution of UDFs are running on storage systems, internal information of the storage system can include information regarding internal storage caching, internal storage partitioning, sharding, and indexing, and an internal storage data protection scheme. With respect to internal storage caching, if the storage objects used in UDF are already in cache and validated, accessing disks and drives to obtain them again can be avoided, which can save storage input/output (I/O) and shorten response time.
- I/O storage input/output
- Such methods can include selecting one or more storage nodes for UDF deployment and invocation.
- the UDF can operate on images or big data.
- An orchestration extension can be provided to explicitly send internal analyzed storage information to influence the serverless UDF’s scheduling process.
- Storage awareness features can be implemented via the advanced scheduling service of this feature by setting configuration and calling the runtime scheduler extender for serverless functions, running as pods, to deploy and run UDF on the most desired storage node.
- a system for invoking storage-side UDFs which is resilient and can be orchestrated based on storage events to form a fully-automated data pipeline, can be realized by a single storage system or multiple distributed storage systems.
- Each storage system can include multiple resilient storage nodes of any type of storage systems, where each storage node can contain software package plugins to support storage-side serverless UDFs.
- Such software package plugins can include instructions, executable by a processor, for receiving registration of storage-side UDFs via REST services.
- Such software package plugins can include instructions, executable by a processor, for serving requests to retrieve runtime information for the storage system including but not restricted to internal and public information of the storage system.
- internal and public information can include, but is not limited to, storage locality, storage system resources, caching, indexing, data protection, service oriented architecture (SOA) policy, and other relevant data associated with the storage system.
- SOA service oriented architecture
- Such software package plugins can include instructions, executable by a processor, for processing requests that invocate storage-side UDFs that operate on storage objects.
- the instructions can include storing result objects, from the UDF operations on the storage objects, in memory or storage media via storage processors or storage nodes.
- FIG. 6 is a flow diagram of features of an embodiment of an example method 600 of storage-side computation.
- the method 600 can be performed using a processing module in a computer-implemented method, where the processing module has one or more storage processors executing stored instructions.
- the one or more one or more storage processors can be structured having tasks to store, retrieve, and protect data.
- a data storage event is detected in a storage system, where the data storage event was initiated from exterior to the storage system.
- the storage system can be arranged in an architecture of a compute system separated from the storage system. In general operation, the compute system can generate data and interface with a user device to store and retrieve data from the storage system.
- the storage system can be located locally with the compute system or remotely from the compute system.
- the storage system can have one or more storage nodes, where each storage node has memory and one or more storage processors.
- the storage system can also contain data storage equipment in the form of disk drives, where the disk drives are either external to the storage nodes or as part of the storage nodes.
- Each storage node of the storage system can control a certain portion of data storage, and can manage operations on data including operations to store, delete, access, copy, and perform other data operations. These operations can be performed with respect to storage objects on the disk drives associated with the respective storage node.
- Notification of data storage events in a storage node can be conducted via a storage processor for the storage node, where the storage processor or associated memory or circuitry has logic to monitor these events.
- the logic can monitor request and command structures received in the storage node along with monitoring metadata of the request and command structures and data of the data storage event.
- the notification can be made to a control module in the storage system, where the control module includes one or more processors to execute stored instructions to perform as an orchestrator for the storage nodes of the storage system.
- the notification can also be conveyed to the compute system of the architecture.
- Extracting the metadata from the data storage event can include electronically reading requests, commands, or data received in the storage system from exterior to the storage system.
- the source of the metadata of the data storage event can include one or more of a client user device and one or more compute systems.
- a UDF in response to detecting the data storage event in the storage system and after completing the data storage event, a UDF is automatically invoked directly within the storage system, based on the metadata extracted, where the UDF resides within the storage system.
- Automatically invoking the UDF can include initiating the invocation with notifying services within the storage system.
- the UDF can automatically run directly within the storage system.
- a result of operation of the UDF in the storage system can be stored in the storage system upon completing generation of the result without providing the result to a client source or a compute system that was part of initiation of the data storage event.
- the result may be provided to a user device via a compute system associated with the storage system upon determination of the result, depending on the parameters of the UDF, or at some later time in response to a request for the result.
- Variations of method 600 or methods similar to the method 600 can include a number of different embodiments that may be combined depending on the application of such methods and/or the architecture of devices or systems in which such methods are implemented.
- Such methods can include registering, in the storage system, the UDF prior to detecting the data storage event.
- the UDF can be registered in the storage system prior to automatically invoking and running the UDF.
- Registering the UDF can include registering, for the UDF, data for matching to one or more of the metadata from the data storage event detected, a trigger condition to respond to the detection of the data storage event, one or more user preferences for use of storage resources of the storage system, or a SLA.
- Variations of method 600 or methods similar to the method 600 can include storing UDFs and parameters for the UDFs in a UDF registry storage in the storage system.
- the UDF registry storage can be implemented as a specific storage volume in the storage system.
- each storage node can have a UDF registry storage.
- the UDF registry storage of each storage node can be arranged to store the same UDF or UDF information or one or more of the storage nodes can be arranged to store specific UDF or UDF information that can be different than that stored on other storage nodes of the storage system.
- the storage system, on which the UDF can be run can be a storage network that connects and uses multiple storage systems.
- the multiple storage systems can be structured in a data center.
- Variations of method 600 or methods similar to the method 600 can include providing security measures, specific to the UDFs, in storing the UDFs in the UDF registry or in retrieving the UDFs from the UDF registry. Variations can include, based on the detection of the data storage event in the storage system, processing and analyzing an event notification, one or more UDFs, or one or more target storage objects.
- Variations can include performing operations with the storage system structured as a single node with multiple storage nodes or as multiple storage systems, where each of the multiple storage systems can have one or more storage nodes. Variations can include scheduling operation of the UDF and selecting one storage node of multiple storage nodes in the storage system on which to run the UDF. To choose a node, a node score can be generated to select the one storage node using one or more of metadata of the UDF, node configurations of the multiple storage nodes, runtime event information, or runtime storage system information.
- the detecting of the data storage event and the automatic invoking of the UDF can be performed directly within the storage system in one node of multiple storage nodes of the storage system.
- the detecting of the data storage event and the automatic invoking of the UDF can be performed directly within the storage system in one node of multiple storage nodes of the storage system.
- Performing the detecting of the data storage event and the automatic invoking and running of the UDF directly within the storage system can include using a storage-side protocol of automatic invoking and running of the UDF directly within the storage system integrated with one or more protocols that perform storage operations in the storage system.
- a non-transitory machine-readable storage device such as computer-readable non-transitory medium
- the physical structures of such instructions may be operated on by one or more storage processors.
- executing these physical structures can cause the machine to perform operations comprising detecting a data storage event in a storage system, the data storage event initiated from exterior to the storage system; extracting metadata from the data storage event detected; and in response to detecting the data storage event in the storage system and after completing the data storage event, automatically invoking a UDF directly within the storage system, based on the metadata extracted, with the UDF residing within the storage system.
- Operations can include storing, in the storage system, a result of operation of the UDF in the storage system upon completing generation of the result without providing the result to a client source or a compute system that was part of initiation of the data storage event.
- the result may be provided to a user device via a compute system associated with the storage system upon determination of the result, depending on the parameters of the UDF, or at some later time in response to a request for the result.
- Operations executed by the one or more processors can include registering, in the storage system, the UDF prior to detecting the data storage event. Registering the UDF can include registering, for the UDF, data for matching to one or more of the metadata from the data storage event detected, a trigger condition to respond to the detection of the data storage event, one or more user preferences for use of storage resources of the storage system, or a SLA. Operations executed by the one or more processors can include storing UDFs and parameters for the UDFs in a UDF registry storage in the storage system.
- Operations executed by the one or more processors of the storage system can include providing security measures, specific to the UDFs, in storing the UDFs in the UDF registry or in retrieving the UDFs from the UDF registry. Operations can include, based on the detection of the data storage event in the storage system, processing and analyzing an event notification, one or more UDFs, or one or more target storage objects.
- Operations executed by the one or more processors of the storage system can include computer-implemented method includes scheduling an operation of the UDF and selecting one storage node of multiple storage nodes in the storage system on which to run the UDF. Operations can include generating a node score to select the one storage node using one or more of metadata of the UDF, node configurations of the multiple storage nodes, runtime event information, or runtime storage system information.
- Operations can include performing the detecting of the data storage event and the automatic invoking of the UDF directly within the storage system in one node of multiple storage nodes of the storage system. Operations can include performing the detecting of the data storage event and the automatic invoking of the UDF directly within the storage system including using a storage-side protocol of automatic invoking of the UDF directly within the storage system integrated with one or more protocols that perform storage operations in the storage system.
- a storage system can comprise a memory storing instructions and one or more processors in communication with the memory, where the one or more storage processors execute the instructions.
- the instructions include instructions to detect a data storage event in the storage system, where the data storage event is initiated from exterior to the storage system, and extract metadata from the data storage event detected.
- a UDF is automatically invoked directly within the storage system, based on the metadata extracted, with the UDF residing within the storage system.
- the one or more storage processors can be structured to be operable to store, in the storage system, a result of operation of the UDF in the storage system upon completing generation of the result without providing the result to a client source or a compute system that was part of initiation of the data storage event.
- the storage system can be arranged in an architecture of a compute system separated from the storage system.
- the compute system can generate data and interface with a user device to store and retrieve data form the storage system.
- the storage system can be located locally with the compute system or remotely from the compute system. Communication between the compute system and the storage system can be implemented over local communication and interface instrumentalities or over a communication network.
- the storage system can have one or more storage nodes, where each storage node has memory and one or more storage processors.
- the storage system can also contain data storage equipment in the form of disk drives, where the disk drives are either external to the storage nodes or as part of the storage nodes.
- Each storage node of the storage system can control certain portion of data storage, and can manage operations on data including operations to store, delete, access, copy, and perform other data operations. These operations can be performed with respect to storage objects on the disk drives associated with the respective storage node.
- Notification of events in a storage node can be conducted via a storage processor for the storage node, where the storage processor or associated memory or circuitry has logic to monitor these events. The notification can be made to a control module in the storage system, where the control module includes one or more processors and stored instructed to perform as a orchestrator for the storage nodes of the storage system.
- the notification can be conveyed to the compute system of the architecture.
- Variations of such a storage system or similar systems can include the one or more storage processors structured to be operable to execute stored instructions to register the UDF in the storage system prior to detection of the data storage event.
- the registration of the UDF can include registration of one or more parameters for the UDF including data for matching to the metadata from the data storage event detected, a trigger condition to respond to the detection of the storage event, one or more user preferences for use of storage resources of the storage system, or a SLA.
- Such a storage system can include a UDF registry that stores UDFs and parameters for the UDFs.
- Variations of such a storage system or similar systems can include the one or more storage processors structured to be operable to provide security measures, specific to the UDFs, to storage of the UDFs in the UDF registry or to retrieval of the UDFs from the UDF registry.
- the one or more storage processors can be structured to be operable to execute the instructions to, based on the detection of the storage event in the storage system, process and analyze an event notification, one or more UDFs, or one or more target storage objects.
- Variations of such a storage system or similar systems can include the one or more storage processors structured to be operable to execute the instructions to orchestrate scheduling an operation of the UDF and select one storage node of multiple storage nodes in the storage system on which to run the UDF.
- the one or more storage processors can be operable to execute the instructions to generate to generate a node score to select the one storage node using one or more of metadata of the UDF, node configurations of the multiple storage nodes, runtime event information, or runtime storage system information.
- Variations of such a storage system or similar systems can include multiple nodes with each node having a storage processor operable to detect a specific data storage event for the node and to execute a specific UDF in the node in response to detection of the specific data storage event in the node.
- the instructions to automatically invoke a UDF directly within the storage system in response to detection of a data storage event in the storage system are portable to different types of storage systems using different storage protocols and capable of operation with one or more protocols that perform storage operations in the storage system.
- FIG. 7 is a block diagram illustrating components of an example system 700 that can implement algorithms and perform methods structured to conduct real-time event-driven serverless functions within storage systems, as taught herein.
- the system 700 can be implemented in a compute system - storage system architecture.
- the system 700 can include one or more processors 750 that can be structured to execute stored instructions to perform functions of a storage system having a source-side UDF controller for performing real-time event-driven serverless functions within the storage system.
- the one or more processors 750 can be storage processors.
- the source-side UDF controller can be implemented as one or more processors and memory with stored instructions for automatically executing UDFs within the storage system as taught herein, with the source-side UDF controller in communication with one or more servers within the storage system.
- a source-side UDF controller can be implemented as instructions in a number of storage servers of the storage system to automatically execute UDFs within the storage system as taught herein.
- the source-side UDF controller can be implemented as instructions in each storage server of the storage system.
- the source-side UDF controller implemented as instructions in the storage system can be realized with plug-in software.
- the one or more processors 750 can be realized by hardware processors.
- the system 700 may operate as a standalone system or may be connected, for example networked, to other systems. In a networked deployment, the system 700 may operate in the capacity of a server machine, a client machine, or both in server-client network environments. In an example, the system 700 may act as a peer machine in peer-to-peer (P2P) (or other distributed) network environment. Further, while only a single system is illustrated, the term “system” shall also be taken to include any collection of systems that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein, such as cloud computing, software as a service (SaaS), or other computing cluster configurations.
- the example system 700 can be arranged to operate with one or more other devices structured to perform real-time event-driven serverless user- defined functions within a storage system as taught herein.
- the system 700 can include a main memory 754, and a static memory 756, some or all of which may communicate with each other via a communication link 758.
- the communication link (e.g., bus) 758 can be implemented as a bus, a local link, a network, other communication path, or combinations thereof.
- the system 700 may further include a display device 760, an input device 762 (e.g., a keyboard), a user interface (UI) navigation device 764 (e.g., a mouse), and a signal generation device 768 (e.g., a speaker).
- the display device 760, input device 762, and UI navigation device 764 can be a touch screen display.
- the system 700 can include an output controller 769, such as a serial (e.g., USB, parallel, or other wired or wireless (e.g., infrared (IR), near field communication (NFC), etc.) connection to communicate or control one or more peripheral devices (e.g., a printer, card reader, etc.).
- the display device 760, input device 762, and UI navigation device 764 can be a touch screen display.
- the system 700 can include an output controller 769.
- the system 700 can include one or more sensors 766 as IoT clients on the compute-side of the system.
- the display device 760, the input device 762, the UI navigation device 764, the signal generation device 768, the output controller 969, and the sensors 966 can be structured as part of a compute system in a compute system - storage system architecture for the system 700.
- the system 700 can include a machine-readable medium 752 on which is stored one or more sets of data structures or instructions 755 (e.g., software or data) embodying or utilized by the system 700 to perform any one or more of the techniques or functions for which the system 700 is designed, including controlling, storing, and automatically executing storage-side UDFs with the storage system, where the storage-side UDFs can be registered with respect to one or more storage events.
- the instructions 755 or other data stored on the machine -readable medium 752 can be accessed by the main memory 754 for use by the one or more processors 750.
- the instructions 755 may also reside, completely or at least partially, within the main memory 754, within the static memory 756, within a mass storage 751, or within the one or more processors 750.
- machine-readable medium 752 is illustrated as a single medium, the term “machine-readable medium” can include a single medium or multiple media (e.g., a centralized or distributed database, or associated caches and servers) configured to store the instructions 755 or data.
- the term “machine-readable medium” can include any medium that is capable of storing, encoding, or carrying instructions for execution by the system 700 and that cause the system 700 to perform any one or more of the techniques to which the system 700 is designed, or that is capable of storing, encoding or carrying data structures used by or associated with such instructions.
- Non-limiting machine- readable medium examples can include solid-state memories, optical media, and magnetic media.
- the data from or stored in machine-readable medium 752 or main memory 754 can be transmitted or received over a communications network 759 using a transmission medium via a network interface device 753 utilizing any one of a number of transfer protocols (e.g., frame relay, Internet protocol (IP), transmission control protocol (TCP), user datagram protocol (UDP), hypertext transfer protocol (HTTP), etc.).
- transfer protocols e.g., frame relay, Internet protocol (IP), transmission control protocol (TCP), user datagram protocol (UDP), hypertext transfer protocol (HTTP), etc.
- Example communication networks can include a local area network (LAN), a wide area network (WAN), a packet data network (e.g., the Internet), mobile telephone networks (e.g., cellular networks), Plain Old Telephone (POTS) networks, and wireless data networks (e.g., Institute of Electrical and Electronics Engineers (IEEE) 802.11 family of standards known as Wi-Fi®, IEEE 802.16 family of standards known as WiMax®), IEEE 802.15.4 family of standards, peer-to-peer (P2P) networks, among others.
- the network interface device 753 can include one or more physical jacks (e.g., Ethernet, coaxial, or phone jacks) or one or more antennas to connect to the communications network.
- the network interface device 753 can include a plurality of antennas to wirelessly communicate using at least one of single-input multiple-output (SIMO), multiple-input multiple-output (MIMO), or multiple-input single-output (MISO) techniques.
- transmission medium shall be taken to include any tangible medium that is capable of carrying instructions to and for execution by the system 700, and includes instrumentalities to propagate digital or analog communications signals to facilitate communication of such instructions, which instructions may be implemented by software.
- the network interface device 753 can operate in conjunction with the network 759 to communicate between a storage system or components of the storage system and a compute system or components of the compute system in a compute system - storage system architecture.
- the system 700 can be implemented in a cloud environment.
- the components of the illustrative devices, systems, and methods employed in accordance with the illustrated embodiments can be implemented, at least in part, in digital electronic circuitry, analog electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. These components can be implemented, for example, as a computer program product such as a computer program, program code or computer instructions tangibly embodied in a machine-readable storage device, for execution by, or to control the operation of, data processing apparatus such as a programmable processor, a computer, or multiple computers.
- DSP digital signal processor
- ASIC application-specific integrated circuit
- FPGA field-programmable gate array
- a general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine.
- a processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
- processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer.
- a processor will receive instructions and data from a read-only memory or a random-access memory or both.
- the elements of a computer include a processor for executing instructions and one or more memory devices for storing instructions and data.
- a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks.
- the processor and the memory can be supplemented by, or incorporated in special purpose logic circuitry.
- An architecture and protocol is provided that can allow a user to define UDFs in the storage-side of a compute-side-storage-side architecture.
- This architecture and protocol can provide a fully automated real-time event-driven data pipeline with efficient data processing, extra security, and lower operational cost.
- the event-driven and storage-aware serverless function architecture, protocol, and related techniques can be implemented to attain a fully-automated execution plan that is most appropriate for storage-side serverless functions, based on storage resource allocations.
- Such an architecture and protocol or similar architecture and protocol can reduce network I/O, take full advantage of storage resource, speed up overall processing time, reduce operational cost, and improve data security.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
An efficient structure and methodology are provided to automatically run a user-defined function directly within a storage system in response to detection of a storage event in the storage system, where the user-defined function resides within the storage system. A user-defined function controller can be implemented as a storage-system plugin architecture and protocol that can be integrated and deployed into storage systems and cloud storage backends. The architecture and protocol can allow users to write their own storage functions and to register different types of events to storage systems. In various embodiments, a user-defined function controller can respond to an event notification and can automatically deploy serverless function deployment to support a fully automated data pipeline. The serverless deployment can be storage aware via orchestration to select appropriate storage nodes to deploy and run serverless user-defined functions based on real-time resources and special hardware.
Description
REAL-TIME EVENT-DRIVEN SERVERLESS FUNCTIONS WITHIN STORAGE SYSTEMS FOR NEAR DATA PROCESSING
TECHNICAL FIELD
[0001] The present disclosure is related to system data storage and, in particular, to providing operations of data-related functions in a data storage system.
BACKGROUND
[0002] Modern artificial intelligence (AI) and analytics consume a tremendous amount of data, which are stored in the on-premise storage systems or cloud storage systems, normally data have to be transferred from storage side to computation side for computation, this causes huge network traffic, and other inefficiencies. There are many examples in different industries to demonstrate this problem, for example, health care operations typically generate a significant amount of data such as raw medical images including x-rays, cell tissue pathology, CT scan, MRI images, and other image data. The generated data can include large objects, for example as large as 100 GB per image, created by various devices. These raw images are data processed to generate useful metadata such as disease auto detection and prediction and de-identification. De-identification is performed to protect privacy of patients about whom the data is collected. Common scenarios for use of the collected data include doctors who use information regarding auto detection results per patient and researchers who use images with similar features from as many patients as possible, where the images are separate from the identity of the patients (de -identified).
[0003] A conventional process, performed by hospital assistants or data scientists, includes manually downloading images from a storage infrastructure to one or more compute servers that can operate on the downloaded images. A compute system, such as compute servers, performs computations on data and can store the results of the computation in a storage system. Image detection or prediction modules or procedures can be applied to evaluate risk followed by application of de-identification modules or procedures to comply with governmental requirements such as the Health Insurance Portability and Accountability Act (HIP A A). The de-identified data can be sent to the researchers, and detection results can be sent to physicians. This large set of
data is used for multiple tasks, where each task may only use a portion of the total data. The partial use of a large integrated amount of data is not limited to the health care industry, but is applicable to other industries that use big data.
SUMMARY
[0004] It is an object of various embodiments to provide an efficient architecture and methodology to automatically run a user-defined function (UDF) directly within a storage system in response to detection of a storage event in the storage system, where the UDF resides within the storage system. Execution of UDFs on a storage system can enhance performance by reducing transportation of a huge amount of data between compute systems and storage systems. Detection of a storage event by the storage system can allow full automation driven by the storage event-driven data pipeline. A UDF controller can be implemented as a storage-system plugin architecture and protocol that can be integrated and deployed into storage systems and cloud storage backends.
The architecture and protocol can allow users to write their own storage-side UDFs and to register different types of events to storage systems. A UDF controller can respond to an event notification and can automatically invoke serverless function deployment to support fully automated data pipeline. The serverless deployment can be storage aware via orchestration to select appropriate storage nodes to deploy serverless UDFs based on real-time resources and special hardware. Storage-side serverless UDF with full orchestration can be run on the storage-side in a compute system-storage system architecture with near data processing.
[0005] According to a first aspect of the present disclosure, there is provided a storage system, the storage system comprising a memory storing instructions and one or more storage processors in communication with the memory, wherein the one or more storage processors execute the instructions. The one or more storage processors execute the instructions to detect a data storage event in the storage system, where the data storage event is initiated from exterior to the storage system. Metadata is extracted from the data storage event detected. In response to detection of the data storage event in the storage system and after completion of the data storage event, the one or more storage processors execute instructions to automatically invoke a user-defined function directly within the
storage system, based on the metadata extracted, with the user-defined function residing within the storage system.
[0006] In a first implementation form of the storage system according to the first aspect as such, the one or more storage processors are operable to store, in the storage system, a result of operation of the user-defined function in the storage system upon completing generation of the result without providing the result to a client source or a compute system that was part of initiation of the data storage event.
[0007] In a second implementation form of the storage system according to the first aspect as such or any preceding implementation form of the first aspect, the one or more storage processors are operable to execute stored instructions to register the user-defined function in the storage system prior to detection of the data storage event.
[0008] In a third implementation form of the storage system according to the first aspect as such or any preceding implementation form of the first aspect, registration of the user defined function includes registration of one or more parameters for the user-defined function including data for matching to the metadata from the data storage event detected, a trigger condition to respond to the detection of the storage event, one or more user preferences for use of storage resources of the storage system, or a service -level agreement.
[0009] In a fourth implementation form of the storage system according to the first aspect as such or any preceding implementation form of the first aspect, the storage system includes a user-defined function registry that stores user-defined functions and parameters for the user-defined functions.
[0010] In a fifth implementation form of the storage system according to the first aspect as such or any preceding implementation form of the first aspect, the one or more storage processors are operable to provide security measures, specific to the user-defined functions, to storage of the user-defined functions in the user-defined function registry or to retrieval of the user-defined functions from the user-defined function registry.
[0011] In a sixth implementation form of the storage system according to the first aspect as such or any preceding implementation form of the first aspect, the one or more storage processors are operable to execute the instructions to process and analyze an event notification, one or more user-defined functions, or
one or more target storage objects, based on the detection of the data storage event in the storage system.
[0012] In a seventh implementation form of the storage system according to the first aspect as such or any preceding implementation form of the first aspect, the one or more storage processors are operable to execute the instructions to orchestrate scheduling operation of the user-defined function and select one storage node of multiple storage nodes in the storage system on which to run the user-defined function.
[0013] In an eighth implementation form of the storage system according to the first aspect as such or any preceding implementation form of the first aspect, the one or more storage processors are operable to execute the instructions to generate a node score to select the one storage node using one or more of metadata of the user-defined function, node configurations of the multiple storage nodes, runtime event information, or runtime storage system information.
[0014] In a ninth implementation form of the storage system according to the first aspect as such or any preceding implementation form of the first aspect, the storage system includes multiple nodes with each node having a storage processor operable to detect a specific data storage event for the node and execute a specific user-defined function in the node in response to detection of the specific data storage event in the node.
[0015] In a tenth implementation form of the storage system according to the first aspect as such or any preceding implementation form of the first aspect, the instructions to automatically invoke a user-defined function directly within the storage system in response to detection of a data storage event in the storage system are portable to different types of storage systems using different storage protocols and capable of operation with one or more protocols that perform storage operations in the storage system.
[0016] According to a second aspect of the present disclosure, there is provided a computer-implemented method of storage-side computation. The computer-implemented method comprises detecting a data storage event in a storage system, where the data storage event is initiated from exterior to the storage system and extracting metadata from the data storage event detected.
The computer-implemented method comprises, in response to detecting the data
storage event in the storage system and after completing the data storage event, automatically invoking a user-defined function directly within the storage system, based on the metadata extracted, with the user-defined function residing within the storage system.
[0017] In a first implementation form of the computer-implemented method of storage-side computation according to the second aspect as such, the computer- implemented method includes storing, in the storage system, a result of operation of the user-defined function in the storage system upon completing generation of the result without providing the result to a client source or a compute system that was part of initiation of the data storage event.
[0018] In a second implementation form of the computer-implemented method of storage-side computation according to the second aspect as such or any preceding implementation form of the second aspect, the computer-implemented method includes registering, in the storage system, the user-defined function prior to detecting the data storage event.
[0019] In a third implementation form of the computer-implemented method of storage-side computation according to the second aspect as such or any preceding implementation form of the second aspect, registering the user defined function includes registering, for the user-defined function, data for matching to one or more of the metadata from the data storage event detected, a trigger condition to respond to the detection of the data storage event, one or more user preferences for use of storage resources of the storage system, or a service -level agreement.
[0020] In a fourth implementation form of the computer-implemented method of storage-side computation according to the second aspect as such or any preceding implementation form of the second aspect, the computer-implemented method includes storing user-defined functions and parameters for the user- defined functions in a user-defined function registry storage in the storage system.
[0021] In a fifth implementation form of the computer-implemented method of storage-side computation according to the second aspect as such or any preceding implementation form of the second aspect, the computer-implemented method includes providing security measures, specific to the user-defined functions, in storing the user-defined functions in the user-defined function
registry or in retrieving the user-defined functions from the user-defined function registry.
[0022] In a sixth implementation form of the computer-implemented method of storage-side computation according to the second aspect as such or any preceding implementation form of the second aspect, the computer-implemented method includes, based on the detection of the data storage event in the storage system, processing and analyzing an event notification, one or more user-defined functions, or one or more target storage objects.
[0023] In a seventh implementation form of the computer-implemented method of storage-side computation according to the second aspect as such or any preceding implementation form of the second aspect, the computer- implemented method includes scheduling operation of the user-defined function and selecting one storage node of multiple storage nodes in the storage system on which to run the user-defined function.
[0024] In an eighth implementation form of the computer-implemented method of storage-side computation according to the second aspect as such or any preceding implementation form of the second aspect, the computer- implemented method includes generating a node score to select the one storage node using one or more of metadata of the user-defined function, node configurations of the multiple storage nodes, runtime event information, or runtime storage system information.
[0025] In a ninth implementation form of the computer-implemented method of storage-side computation according to the second aspect as such or any preceding implementation form of the second aspect, the computer-implemented method includes performing the detecting of the data storage event and the automatic invoking of the user-defined function directly within the storage system in one node of multiple storage nodes of the storage system.
[0026] In a tenth implementation form of the computer-implemented method of storage-side computation according to the second aspect as such or any preceding implementation form of the second aspect, performing the detecting of the data storage event and the automatic invoking of the user-defined function directly within the storage system includes using a storage-side protocol of automatic invoking of the user-defined function directly within the storage system integrated with one or more protocols that perform storage operations in
the storage system.
[0027] According to a third aspect of the present disclosure, there is provided a non-transitory computer-readable medium storing instructions for storage-side computation, which, when executed by one or more storage processors, cause the one or more processors to perform operations. The operations comprise detecting a data storage event in a storage system, the data storage event initiated from exterior to the storage system; extracting metadata from the data storage event detected; and in response to detecting the data storage event in the storage system and after completing the data storage event, automatically invoking a user-defined function directly within the storage system, based on the metadata extracted, with the user-defined function residing within the storage system. [0028] In a first implementation form of the non-transitory computer-readable medium according to the third aspect as such, the operations include storing, in the storage system, a result of operation of the user-defined function in the storage system upon completing generation of the result without providing the result to a client source or a compute system that was part of initiation of the data storage event.
[0029] In a second implementation form of the non-transitory computer- readable medium according to the third aspect as such or any preceding implementation form of the third aspect, the operations include registering, in the storage system, the user-defined function prior to detecting the data storage event.
[0030] In a third implementation form of the non-transitory computer-readable medium according to the third aspect as such or any preceding implementation form of the third aspect, the operations include registering the user defined function including registering, for the user-defined function, one or more of the metadata from the data storage event detected, a trigger condition to respond to the detection of the data storage event, one or more user preferences for use of storage resources of the storage system, or a service -level agreement.
[0031] In a fourth implementation form of the non-transitory computer- readable medium according to the third aspect as such or any preceding implementation form of the third aspect, the operations include storing user- defined functions and parameters for the user-defined functions in a user-defined function registry storage in the storage system.
[0032] In a fifth implementation form of the non-transitory computer-readable medium according to the third aspect as such or any preceding implementation form of the third aspect, the operations include providing security measures, specific to the user-defined functions, in storing the user-defined functions in the user-defined function registry or in retrieving the user-defined functions from the user-defined function registry.
[0033] In a sixth implementation form of the non-transitory computer-readable medium according to the third aspect as such or any preceding implementation form of the third aspect, the operations include, based on the detection of the data storage event in the storage system, processing and analyzing an event notification, one or more user-defined functions, or one or more target storage objects.
[0034] In a seventh implementation form of the non-transitory computer- readable medium according to the third aspect as such or any preceding implementation form of the third aspect, the operations include scheduling operation of the user-defined function and selecting one storage node of multiple storage nodes in the storage system on which to run the user-defined function. [0035] In an eighth implementation form of the non-transitory computer- readable medium according to the third aspect as such or any preceding implementation form of the third aspect, the operations include generating a node score to select the one storage node using one or more of metadata of the user-defined function, node configurations of the multiple storage nodes, runtime event information, or runtime storage system information.
[0036] In a ninth implementation form of the non-transitory computer-readable medium according to the third aspect as such or any preceding implementation form of the third aspect, the operations include performing the detecting of the data storage event and the automatic invoking of the user-defined function directly within the storage system in one node of multiple storage nodes of the storage system.
[0037] In a tenth implementation form of the non-transitory computer-readable medium according to the third aspect as such or any preceding implementation form of the third aspect, performing the detecting of the storage event and the automatic invoking of the user defined function directly within the storage system includes using a storage-side protocol of automatic invoking of the user
defined function directly within the storage system integrated with one or more protocols that perform storage operations in the storage system.
[0038] Any one of the foregoing examples may be combined with any one or more of the other foregoing examples to create a new embodiment in accordance with the present disclosure.
BRIEF DESCRIPTION OF THE DRAWINGS [0039] The drawings illustrate generally, by way of example, but not by way of limitation, various embodiments discussed in the present document.
[0040] Figure 1 is an illustration of an example architecture for a storage-side user-defined function for which a storage system interfaces with a compute system, according to various embodiments.
[0041] Figure 2 illustrates features of an example event-driven user-defined function invocation sequence flow in the architecture of Figure 1 , according to various embodiments.
[0042] Figure 3 illustrates an example automatic data pipeline for providing access to health data by different users using a storage-side user-defined function system similar to the architecture of Figure 2, according to various embodiments. [0043] Figure 4 shows an example of a sequence of user-defined function deployment with respect to node selection, according to various embodiments. [0044] Figure 5 illustrates an example of factors that influence a decision in a storage-side serverless scheduler to select a storage node for user-defined function execution, according to various embodiments.
[0045] Figure 6 is a flow diagram of features of an example method of storage-side computation, according to various embodiments.
[0046] Figure 7 is a block diagram illustrating components of an example system that can implement algorithms and perform methods structured for real time event -driven serverless functions within storage systems, according to various embodiments.
DETAIFED DESCRIPTION
[0047] In the following description, reference is made to the accompanying drawings that form a part hereof, and in which is shown by way of illustration specific embodiments which may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the embodiments, and it is to be understood that other embodiments may be utilized,
and that structural, logical, mechanical, and electrical changes may be made.
The following description of example embodiments is, therefore, not to be taken in a limited sense.
[0048] The functions or algorithms described herein may be implemented in software in an embodiment. The software may comprise computer-executable instructions stored on computer-readable media or computer-readable storage device such as one or more non-transitory memories or other type of hardware- based storage devices, either local or networked. Further, such functions correspond to modules, which may be software, hardware, firmware, or any combination thereof. Multiple functions may be performed in one or more modules as desired, and the embodiments described are merely examples. The software may be executed on a digital signal processor, application-specific integrated circuit (ASIC), a microprocessor, or other type of processor operating on a computer system, such as a personal computer, server, or other computer system, turning such computer system into a specifically programmed machine. [0049] Computer-readable non-transitory media include all types of computer- readable media, including magnetic storage media, optical storage media, and solid state storage media and specifically excludes signals. It should be understood that the software can be installed in and sold with the devices that implement arrangements of compute clusters and storage clusters for artificial intelligence training or other data intense operations as taught herein. Alternatively, the software can be obtained and loaded into such devices, including obtaining the software via a disc medium or from any manner of network or distribution system, including, for example, from a server owned by the software creator or from a server not owned but used by the software creator. The software can be stored on a server for distribution over the Internet, for example.
[0050] A large set of data having applicability with multiple tasks, where each task may only use a portion of the total data, can present a number of issues with the processing of data for the multiple tasks in an architecture structured for a compute system to operate on data in a storage system. For example, for servers of a storage system to operate on data in a storage system, the data is downloaded from the storage system to the servers for the different tasks, which can result in data duplication when there is an overlap of data associated with the
different tasks. There can be a waste of network bandwidth and low performance for a large number of bytes transferred between storage and compute layers, for example via a wide area network (WAN). Other issues can include occurrence of a security risk due to exposure of sensitive data. Manual processes of a conventional procedure, for example in big data processing such as in health care operation, can be error prone and have high labor cost. Replication of generic functions in many applications can result in client-side infrastructure cost.
[0051] In various embodiments, an architecture and protocol can be implemented to allow a user, via a user device, to define UDFs in a storage side of the architecture, and to provide a fully automated real-time event-driven data pipeline. This can provide efficient data processing, extra security, lower cost, and worry-free UDF invocation. The storage side architecture can be implemented in communication with a compute side architecture. The storage side of the architecture can include one or more storage systems. A storage system is a system primarily dedicated to storing data and performing operations to create, read, update, and delete data along with other data-oriented operations and associated storage services that can replicate data, make extra copies of data, and take snapshots of the data. Further, each storage system can have a set of storage nodes, where a storage node can operate as a storage system. A pipeline is a set of data processing elements connected in series, where the output of one element is the input of the next one. Typically, the elements of a pipeline can be executed in parallel or in a time-sliced fashion.
[0052] UDFs are functions provided by a user that can be built into a program or environment, and run in the program or environment. A UDF is most commonly used in the analytics field or the database field, where the UDF can extend the native functionality of the query engine for the analytics field or the database field. UDFs are commonly used in the fields of big data analytics (BDA) and artificial intelligence (AI), where the native functionality of the compute platforms are not enough to cover the complex logics in BDA and AI computation. However, there are major gaps in current UDF implementation. UDFs can be characterized by bad performance, for example a small amount of UDFs in a conventional analytics engine for big data processing can take up to 70% of the central processing unit (CPU) time for a number or reasons. UDFs
often run as a black-box on the compute side and are difficult to implement in an optimized manner in a query engine such as a structured query language (SQL) engine. UDFs often cannot be pushed down to the data source, so a huge amount of data is to be shipped from the storage system to the compute system, where the compute system cannot perform near data processing (NDP).
Although UDFs are currently most used in the context of a SQL query, UDF scope in a query is typically too narrow.
[0053] In various embodiments, an architecture can be implemented in which storage-side UDFs can be pushed down to a storage layer and operate on storage objects in-place directly in the storage side. The storage-side UDFs defined by a user are NDP procedures whose computations in the storage side can be more efficient than enterprise storage system processing and cloud storage backend processing. The storage-side UDFs can be provided for serverless computing, which is a method of providing backend services on an as-used basis. Servers are still used in serverless computing, where serverless computing is charged based on usage, not on a fixed amount of bandwidth or number of servers. Current conventional serverless computing only exists on the compute-side of a compute-side - storage-side architecture.
[0054] The storage-side UDFs can be implemented as orchestrated serverless functions on a storage system. The architecture of storage-side UDFs can be integrated into storage systems or cloud storage backend having a number of characteristics. The storage systems for integration can be generic storage systems, in which UDFs on any format of (big data) object types like comma- separated values (CVS), JavaScript object notation (JSON), columnar, images, streaming, and other object types are supported. The UDFs in the storage-side architecture can use standard storage protocols. The storage systems can run in their own private storage network, which can have or not have internet connection for security and performance, where some common practices for compute-side serverless operation are modified.
[0055] The storage-side UDFs can be implemented to run directly on storage controllers. The storage controllers can include one or more storage processors and associated memory having stored instructions to manage and run the UDFs. These storage controllers can be assigned many other tasks especially main tasks in storage operations, and can support multi-level service -level agreements
(SLAs). A SLA defines the level of service expected by a customer from a supplier, laying out the metrics by which that service is measured. The SLA can include agreed upon remedies or penalties for service levels not achieved. These storage controllers can control different sets of storage objects, including orchestration of data locality, and can have different resources and hardware composition with orchestration of resource and hardware affinity.
[0056] Storage systems normally have significant internal metadata information from caching, partition, cataloging/indexing, data protection etc., which are only available to storage systems. Storage-side UDFs and their associated services, as taught herein, run on storage systems directly and can use normally generated information for a node selection algorithm to improve performance of the storage-side UDFs in storage systems having multiple nodes. A storage-side control module, comprising instructions executable by storage processors, can use information that the storage system, with which the storage- side control module is correlated, obtains from internally breaking down objects or obtains from different partitions or shardings. An example of internally breaking down objects can include error correction (EC). Partitioning deals with grouping subsets of data within a single database. Sharding deals with optimizing database management systems by separating rows or columns of a database table into multiple smaller tables. Sharding and partitioning both relate to dispersing a large data set into smaller subsets. Typically, sharding implies that the data is spread across multiple computers, while partitioning typically deals with a single database instance. The storage-side control module can behave like any storage client but without running UDFs on the compute-side. Better performance can be achieved by the storage-side control module controlling the running of the UDFs on the storage-side, where data is closer and safer to access from a security standpoint.
[0057] Figure 1 is an illustration of an embodiment of an example architecture 100 for a storage-side UDF for which a storage system 105 interfaces with a compute system 102. The compute system 102 can include a number of instrumentalities that execute various computation operations. The compute system 102 can include a command line interface (CLI) or a graphical user interface (GUI) 103, a compute module 104, a custom computation module 106, or other structures that provide input to a software development kit (SDK) 107.
An SDK functions to provide a set of tools, libraries, relevant documentation, code samples, processes, and or guides that allow developers to create software applications on a specific platform.
[0058] In conventional compute-storage architectures, the compute system of the compute-storage architecture performs computations and stores data in the storage system of the compute-storage architecture. When further operation on data in the storage system of the conventional compute-storage architecture is requested, the data is transferred from the storage system of the conventional compute-storage architecture back to the compute system of the conventional compute-storage architecture for computation, where results are stored in another location in the storage system of the conventional compute-storage architecture.
[0059] In architecture 100, the storage system 105 can include a number of storage servers 110-1, 110-2, . . . 110-N. Each of the storage servers 110-1, 110- 2 . . . 110-N can be structured with similar or identical types of instrumentality arranged in a similar or identical manner. Each of the storage servers 110-1,
110-2, . . . 110-N can include similar modules such as, but not limited to, a native server 112. The native server 112 can include one or more storage processors and stored instructions to control operations on data to store, delete, access, copy, and perform other conventional operations related to storage of data. The native server 112 can operate in conjunction with compute system 102 with respect to the conventional operations related to storage of data from the compute system 102. For example, the SDK 107 of the compute system 102 can operationally communicate with the native server 112 to perform conventional data operations. The native server 112 can include other event target cluster system 126 that receives storage event information from the native server 112, and can send notification to the event listener service 119.
[0060] Each of the storage servers 110-1, 110-2, . . . 110-N can include an event service 113 to monitor data storage activity of the storage servers with respect to the physical volumes of storage and to interface with a storage-side UDF controller system. The event service 113 can identify occurrence of a data storage event. A data storage event is an operation in a storage system related directly to data. Examples of a data storage event can include uploading data, downloading data, copying data, deleting data, getting data, listing data, opening
a file, and similar data oriented operations. A data storage event does not include error-based operations dealing with data storage. The event service 113 can monitor command and address lines to the physical volumes of storage to determine action taken on data. The event service 113 of storage servers 110-1 can interface with a storage-side UDF controller system that is implemented as a NDP server 115-1. The NDP server 115-1 can be integrated within storage server 110-1 as a definable set of instructions to operate with respect to one or more storage-side UDFs using one or more storage processors of storage server 110-1. NDP server 115-1 can be integrated within storage server 110-1 as designated hardware including one or more storage processors to execute instructions of the NDP server 115-1 to operate with respect to one or more storage-side UDFs. Alternatively, the NDP server 115-1 can be implemented as a standalone structure coupled to storage server 110-1, where such a standalone structure includes a set of instructions to operate with respect to one or more storage-side UDFs using one or more storage processors of the NDP server 115- 1.
[0061] Each of the storage servers 110-1, 110-2, . . . 110-N can be arranged with respect to each of NDP servers 115-1, 115-2, . . . 115-N, respectively, in a manner similar to the NDP server 115-1 arranged with the storage server 110-1. Each of the NDP servers 115-1, 115-2, . . . 115-N can be structured in a similar or identical manner. The NDP server 115-1 can include a NDP service 114, a UDF service 116, a function-as-a-service (FaaS) service 117, a UDF registry service 118, and an event listener service 119.
[0062] The NDP service 114 can include, but is not limited to, a storage client base service 135 to control and manage one or many different storage services that behave as storage clients that communicate with underlining storage servers in the storage systems. Typical services can be an object storage service 131, a cloud storage service 132, a file storage service 133, and other storage services 134 such as block storage services, special HDFS (Hadoop Distributed File System) service etc. These services can be implemented as a scalable, high speed, cloud native storage service.
[0063] The NDP Service 114 can be implemented as a storage-side representational state transfer (REST) service that can accept and process common storage requests by operating in compliance with standard protocols.
REST typically defines a set of constraints for the manner in which an architecture, such as an Internet-scale distributed hypermedia system, behaves. REST can provide for scalability of interactions between components, uniform interfaces, independent deployment of components, and a layered architecture to facilitate caching components to reduce user-perceived latency, enforce security, and encapsulate older systems. The storage-side NDP Service 114 can handle more than storage requests, it can include a protocol to process a UDF request as part of storage requests for direct invocation of UDFs. The direct invocation can be automatic initiation of the UDFs. The NDP Service 114 can be implemented with a storage system plugin design to allow UDF support to be highly portable in any storage system and cloud storage backend such that the NDP Service 114 can support more than a specific storage system. The NDP Service 114 can operate in conjunction with the compute system 102, a client 101 separate from the storage system 105 and separate from compute system 102, and other components of the NDP server 115-1.
[0064] The UDF service 116 can manage and execute UDF operations on storage system 105 without uploading data to compute system 102 for computations. The UDF Service 116 can be implemented as a storage-side REST service to validate and invoke UDFs with the option of using serverless or standalone containers. The UDF Service 116 can be initiated from input from the event listener service 119. Results of the invocation of the UDF Service 116 can be provided back to the NDP Service 114. A UDF function in the storage system 105 of the architecture 100 can be a complete serverless UDF that compiles, publishes, and deploys the UDF as a serverless function that combines user-defined function and common boilerplate code. The UDF function can read from and write to storage directly via a storage client within the storage system 105. The UDF service 116 can operate in conjunction with the FaaS service 117 and can send storage locality and resource information to the FaaS service 117 to optimize running of functions.
[0065] The FaaS service 117 can be implemented as a serverless client and a service to manage serverless functions. FaaS can be implemented as a serverless platform to allow users to execute code on storage system 105 as a network edge. With FaaS, users such as developers can build a modular architecture, creating a base for code that is more scalable without having to implement
resources or maintain an underlying backend.
[0066] The UDF registry service 118 can be implemented as a storage-side REST service that provides REST application programming interfaces (APIs) to manage UDF registration in the storage-side. An API is a shared boundary for which two or more separate components of a system exchange information, where the information defines interactions between the two or more separate components. An API can include multiple software applications or mixed hardware-software intermediaries. Implementation of the UDF registry service 118 can be implemented based on an event target cluster 120 and underlying storage of the event target cluster 120. The UDF registry service 118 can be implemented in conjunction with a UDF hosted hub repository service provided for finding and sharing container images. Such a UDF repository can be implemented as a UDF registry 122 of the event target cluster 120. The UDF registry service 118 provides a mechanism to associate a UDF with a storage- side event and can operate in conjunction with the event listener service 119. [0067] The event listener service 119 can identify an occurrence of a data storage event in conjunction with the event service 113 of the storage server 110- 1, which in conjunction with operation of the UDF registry service 118 can initiate operation of the associated UDF by the UDF service. The event listener service 119 can be implemented as a storage-side REST service that listens to registered streaming sources. The event listener service 119 can react to a storage-side event and automatically invoke UDFs based on the storage information obtained upon certain storage actions and events. The event listener service 119 can extract metadata from monitoring a data storage event.
Extracting metadata can be implemented by reading the data stream or command structure of the data storage event. The event listener service 119 can determine if the metadata of the data storage event matches parameters for invoking and running a storage-side UDF in the storage system. If the metadata of the data storage event does not match the parameters, invocation of the storage-side UDF is not initiated. In contrast, conventional invoking of a compute-side UDF is performed manually without storage awareness.
[0068] The NDP server 115-1 can operate with an event target cluster 120 to control and manage storage-side UDFs. The event target cluster 120 can operate as a system for queueing tasks of the NDP server 115-1. The event target cluster
120 can be implemented as an in -memory data structure store, which can be used as a database, a cache, and message broker. The event target cluster 120 can include the UDF registry 122, which provides a repository for UDFs, and an event queue 124. The UDF registry 122 can interface with the FaaS service 117 of the NDP server 115-1 to schedule UDFs for operation based on the event queue 124. The event queue 124 can schedule operations associated with identified data events determined in communication with the event listener service 119 of the of the NDP server 115-1. Similar to arrangement and functionality of storage server 110-1, each of the storage servers 110-1, 110-2 . .
. 110-N can be structured with similar or identical types of instrumentality arranged in a similar or identical manner and can operate in a similar or identical manner.
[0069] The architecture 100 for a storage-side UDF can be implemented as a software architecture, using one or more processors, that provides an automatic event-driven storage-side serverless UDF for data pipeline. The architecture 100 and associated protocols allow users to write their own storage functions and to register different types of events to storage systems based on storage operations, time, and alerts. The user can define the parameters for the triggering events such as the storage operations, time, and alerts. The architecture 100 and associated operating parameters include event notification and auto serverless function deployment to support a fully automated data pipeline.
[0070] The serverless deployment can be storage aware via orchestration to perform a selection process directed to identification of optimal storage nodes of the storage system to deploy the serverless UDFs based on real-time resource and hardware availability, such that a user can avoid running the UDF on an information technology (IT) infrastructure that would lead to complex infrastructure management. The architecture 100 for a storage-side UDF implemented as a software architecture can provide a highly portable storage- side UDF framework, which can be a storage-system plugin architecture and associated protocols that can be easily integrated and deployed into any storage system and cloud storage backend. This software architecture can allow for operating the UDF in systems that previously could not support the operation. [0071] Example architecture 100 can be implemented with one or more compute clusters and/or one or more storage clusters. A compute cluster can
include multiple compute systems, where the compute systems can be structured similar or identical to compute system 102. A storage cluster can include multiple storage systems, where the storage systems can be structured similar or identical to storage system 105. In a data center, these compute systems and storage systems can be connected via a network, and can work together to execute AI operations, analytics operations, or other complex operations.
[0072] Figure 2 illustrates features of an embodiment of an example event- driven UDF invocation sequence flow in the architecture 100 of Figure 1. Operations of the sequence flow can be performed by one or more storage processors executing stored instructions for real-time event-driven serverless functions within the storage system 105. The sequence flow can be performed in a storage system having multiple storage servers operating as multiple storage nodes. For ease of discussion, the sequence flow is described with reference to NDP server 115-1 and storage server 110-1, though a number of storage-side NDPs and storage servers can be used. At operation 1, a registration event is conducted. The client 101 registers notification events to the storage system 105 regarding data of interest categorized according to storage groups, which storage groups can be arranged in storage instruments of the storage system 105 as labelled buckets or folders. This activity provides a registration of the bucket. The client 101 can be a user device directly controlled by a user or can be a system such as but not limited to an Internet of Things (IoT) device. IoT is a network of physical devices, vehicles, home appliances and other items embedded with electronics, software, sensors, and actuators, which are enabled to connect and exchange data allowing for direct integration of devices in the physical world into computer-based systems.
[0073] The storage system 105 sets up the event target cluster 120 to listen for the registered events. The client 101 also registers one or more UDFs into the NDP server 115-1. The storage-side UDF protocol can indicate the trigger condition with parameters that can be set via client 101. The parameters can be data to be matched with metadata in a data storage event to initiate a notification procedure to invoke running a UDF in the storage system after completion of the data storage event. Such registered parameters can include file type, file name, data type, or other parameter to identify a UDF to be invoked. A UDF registration process can be undertaken for each client of the storage system 105.
The UDF registration process can be conducted for each different set of data associated with the client 101.
[0074] At operation 2, the event listener service 119 sets up a notification task with the event target cluster 120 to listen for the registered event notifications. The event listener service 119 includes logic to identify storage action types and storage object information.
[0075] At operation 3, after registration with the storage system 105 of architecture 100 being ready for operation as a storage-side UDF system for client 101, client 101 can upload data to the storage system 105. The data can be but is not limited to image data. The client 101 uploads files to the registered bucket and performs a storage operation like upload, download, copy, delete, get, and list, etc. One or more of these operations can be an event to trigger one or more notifications. A notification can be based on occurrence of multiple storage operations. The protocol for the storage-side UDF system can have other trigger events.
[0076] At operation 4, the storage system 105 finishes the storage operations requested by the client 101 in storage server 110-1 and event service 113 of storage server 110-1 sends notification of event to the event target cluster 120. The notification can send identification of the event in addition to notification of the occurrence of a data storage event. The notification can identity the data bucket and other characteristics of the data storage event. This generation of the notification is an automatic call. At operation 5, the event target cluster 120 calls back to the event listener service 119 regarding the event notification. The event target cluster 120 can also be on the storage side of architecture 100 to improve speed of the notification.
[0077] At operation 6, the event listener service 119 communicates with the UDF registry service 118 to find registered UDFs with matched event type and buckets. The UDF registry service 118 can check with the event target cluster 120 to determine registered UDFs with matched event type and buckets to the data operation that triggered the notification. The event target cluster 120 can access UDF registry 122 of UDFs and the event queue 124 within the event target cluster 120. At operation 7, the event listener service 119 calls the UDF registry service 118 for matched UDFs. The storage-side UDF system realized by the NDP server 115-1 can support storage-side registry and private registry to
improve response time and security.
[0078] At operation 8, the event listener service 119 calls the UDF service 116 for UDF invocations with storage bucket, objects, and UDF information. At operation 9, the UDF service 116 calls storage system 105 to get storage object location and resource allocation, and then sends the storage information to the serverless framework of the FaaS service 117 for it to deploy serverless function into the optimal storage server (node) of the storage servers 110-1 . . . 110-N. The serverless UDF function and storage information can be deployed by the FaaS service 117 as a UDF container.
[0079] At operation 10, the serverless UDF function applies the UDF function computation to the storage object, which generates a result storage object. The result storage object can be stored directly in the storage system 105 without control by the compute system 102. At operation 11, the UDF service 116 calls the NDP Service 114 to upload the result object. The NDP service 114 can support many storage plugins, which allows the storage-side UDF system to be implemented easily with many storage system types. At operation 12, the NDP service 114 calls storage system-specific SDK on the storage-side to upload the result object to a result bucket registered in storage system 105. The NDP service 114 can also upload the result object to the compute system 102, depending on the UDF function, which may be sent to a display coupled to the compute system 102 or remotely over a network.
[0080] As an example in which the data of the data storage event is image data, the example event-driven UDF invocation sequence flow of Figure 2 can be applied to automatically generate a thumbnail in response to storage of an image. Upon upload of a large image into an event -registered object bucket, the storage system, using its integrated storage-side UDF system, can auto-generate a thumbnail of the large image via a registered UDF within the storage system. The thumbnail, as part of the storage-side UDF procedure, can be stored to another object bucket of the storage system. User access to the thumbnail via a request by a compute system, typically at a later time, can be accomplished by accessing the object bucket.
[0081] Figure 3 illustrates an embodiment of an example automatic data pipeline for providing access to health data by different users using a storage- side UDF system similar to the architecture of Figure 2. The use of a storage-
side serverless UDF can provide for real-time, event-driven, scalable, automated, and near data processing, which can provide a data pipeline. A patient’s X-ray provides raw data that can be used by both a clinic and a researcher, though each may only use a portion of the raw data. At operation 1 of Figure 3, a new X-ray of a patient is uploaded into a raw X-ray data bucket 311-1 of a storage instrument 310 in a storage system 302. At operation 2 of Figure 3, a bucket notification service assigned to raw X-ray data bucket 311-1 fires an event notification to an event target cluster 320. The event notification can be a notification of object creation. At operation 3, identification of the new event is received in a storage-side UDF controller 330. The storage-side UDF controller 330 can be implemented as a NDP server. The identification of the new event can be received in an event listener service 319 of the storage-side UDF controller 330 from the event target cluster 320.
[0082] At operation 4 of Figure 3, after receiving the new event notification, the event listener service 319 obtains UDFs that match the event by looking up the UDFs that are registered to the detected new event. The event listener service 319 can invoke the matched UDFs via a UDF service 316. The UDF service 316 calls a serverless platform in which a deployment coordinator 345 automatically deploys a serverless function container 350, using UDF registry 318. The serverless function container 350 can be generated as a pod. A pod is a smallest deployable computing unit in a container scheduling and orchestration environment. It is a collection of one or more containers that operate together, where the containers within a pod share common networking and storage resources from a host node, as well as specifications that determine how the containers run. The automatic deployment can be generated after automatic execution of a node selection algorithm to determine the storage node of the storage system 302 on which to execute the matched UDFs.
[0083] At operation 5 of Figure 3, the serverless function container 350 retrieves the image loaded into the raw X-ray data bucket 311-1. The server less function container 350 operates on the retrieved data using the UDFs that matched the detected event. For example, for the uploaded X-ray image, the UDFs that matched the detected event can be a function to detect pneumonia from the X-ray, which can be used by the clinic, and a function to de -identify the uploaded X-ray image from the patent, which can be used by the researcher.
[0084] At operation 6 of Figure 3, the respective results from executing the UDF to detect pneumonia are automatically saved in an enriched X-ray data bucket 311-2, which can be accessed by the clinic. These saved results can be one or more processed images after pneumonia risk detection. At operation 7 of Figure 3, the respective results from executing the UDF to de -identify the uploaded X-ray image, which results can be one or more de-identified images, are automatically saved in a de-identified X-ray data bucket 311-3, which can be accessed by the researchers. With running the NDP- storage-side UDF support, the entire process becomes a data pipeline that is secure, efficient, and fully automated.
[0085] The event registration mechanism can be implemented with a user filling in one simple configuration file with the storage-side UDF system automatically setting up the invocation process of UDFs in response to detecting presence of real-time events. The following is an example of a code snippet for a simple configuration file: provider: name: openfaas gateway: http://127.0-0.1:8080 functions: storage-side_udf_controller-faas-spring-thumbnail: lang: springboot handler: ./storage-side_udf_controller-faas-spring-thumbnail image: ywang529/storage-side_udf_controller-faas-spring- thumbnaiklatest annotations: #Followings are optional
# auto invoke this UDF whenever receive PUT and COPY storage events on "images" bucket invocation_event_types: "[{ "bucket": "images", "event_type":["put", "copy"]}]"
# This UDF prefers running on GPU, but CPU is also okay if there is no GPU available, and the weight 80 (range 1-100) represents the level of priority invocation_special_resource: "[{"nvidia.com/gpu", "limit": 1,
"type":"nvidia-tesla-pl00", "affinity "/’preferred", "weight" :80}]"
# This UDF prefers running on the storage servers that controls this bucket (data is closer to these servers, ignore this if storage servers are equal).
# The priority of GPU (80) is higher than the ndp (50) invocation_ndp: "[{ "affinity":"preferred", "weight":50"}]
[0086] In this above code snippet, registration of a UDF is conducted with the parameters that a put or copy event on a bucket labelled image triggers an invocation of a UDF. Also specified are parameters for operation of the UDF. The specified parameters include the preferred type of processor to use and the storage server of the storage system to use in executing the invoked UDF. In this example, the user prefers running on a graphics processing unit (GPU), but a CPU is also okay if there is no GPU available. The GPU hardware characteristic is associated with a weight of 80 in the range 1-100, which represents a level of priority in selecting storage-side instrumentalities in the storage system for running the UDF. The input parameters also included the preference of running on the storage server that controls this image but this preference can be ignored in the selection process if storage servers are of equal distance to the data. The priority of 80 for using a GPU is higher than the priority of 50 for using a NDP. [0087] There are a number of other examples of event-driven invocation of UDF for data processing based on bucket notification plus serverless operation. Event-driven invocation of a UDF can include, but are not limited to, invocation upon object download; invocation upon object upload; invocation upon object copy; invocation upon object access/get; invocation upon time; invocation upon storage alerts. Events to optional storage information can also be configured. Thes data storage events can be identified in the storage system from extracting metadata from the data stream to the storage system for the data storage event. UDFs, which can be triggered by these events, are stored and registered in the storage system rather than a compute system. The UDFs can be written in any programming languages, any language version, and any build systems.
Examples of such programming languages include, but are not limited to Java, Scala, and Python. Examples of such build systems include, but are not limited to, Maven and Gradle.
[0088] All the UDF data, including images and metadata, can be stored in a
hosted repository service, in a storage-side UDF system, for finding and sharing container images with automatic builds and automatic builds of container images. The UDF data can be stored in a conventional system, such as a Docker Hub, with proper user credentials. When using a Docker Hub, the metadata information such as UDF invocation condition can be stored in an “annotations” field of each UDF in the Docker Hub. More metadata fields can be added as needed to support more complex applications of a storage-side UDF system. [0089] In various embodiments, a storage-side UDF system supports direct invocation of the storage-side serverless UDF. In addition, standard storage protocols can be supported with UDF additions, for example, Amazon Simple Storage Service (Amz S3) protocol. Amz S3 is a storage instrumentality for the Internet that has a simple web service interface to store and retrieve data at anytime from anywhere on the Internet. Consider the following addition of a storage-side UDF to generate thumbnails for a Amz S3 protocol.
Request Syntax
PUT /Key+ HTTP/1.1 Host: Bucket.s3.amazonaws.com x-amz-acl: ACL Cache -Control: CacheControl Content-Disposition: ContentDisposition Content-Encoding: ContentEncoding Content-Language: ContentLanguage Content-Length: ContentLength Content-MD5: ContentMD5 Content-Type: ContentType Expires: Expires x-amz -grant -ful 1 -control: GrantFullControl x-amz-grant-read: GrantRead x-amz -grant -read-acp: GrantReadACP x-amz -grant -write-acp: GrantWriteACP x-amz-server- side -encryption : ServerS ideEncryption x-amz-storage -class: StorageClass x-amz -website -redirect-location: WebsiteRedirectLocation x-amz-server-side -encryption-customer-algorithm: SSECust
x-amz-server-side -encryption-customer-key: SSECustomerKey x-amz-server-side -encryption-customer-key -MD5: SSECustomerKeyMD5 x-amz-server-side -encryption-aws-kms-key-id: SSEKMSKeyld x-amz-server-side -encryption-context:
SSEKMSEncryptionContext x-amz-server-side -encryption-bucket-key-enabled:
BucketKeyEnabled x-amz -request -payer: RequestPayer x-amz-tagging: Tagging x-amz -obj ect-lock-mode : Obj ectLockMode x-amz -object-lock-retain-until -date: ObjectLockRetainUntilDate x-amz -object-lock-legal-hold: ObjectLockLegalHoldStatus x-amz -expected-bucket -owner: ExpectedBucketOwner x-amz-meta-storage-side_udf_controller_udf_name: "storage - side_udf_controller-faas-spring-thumbnail" x-amz-meta-storage-side_udf_controller_input_parameters: [400, 600] #size of the thumbnail x-amz-meta-storage-side_udf_controller_target_bucket:
" thumbn ails_bucket "
Body
[0090] In this above code snippet, the instructions for executing a storage-side UDF for generating thumbnails and placing the generated thumbnails in a thumbnails bucket is tacked on to the end of a section of the Amz S3 protocol. Storage systems in which a storage-side UDF controller is integrated can be provided with the option of ignoring these tacked-on sections to operate without the use of the storage-side UDF controller.
[0091] A storage-side UDF system, as taught herein, can include one or more different and varied features. Such a storage-side UDF system can be implemented with a user device being able to issue a storage request, while a UDF operation is being performed. For example, a user device can upload an image file to a storage bucket folder, while a request can be issued to run a storage-side UDF that can create a thumbnail from the newly uploaded image
and store the thumbnail into a separate storage bucket. In addition, UDF information can be embedded in the metadata section for the thumbnail.
[0092] Since a storage system can include multiple storage nodes, a storage- side UDF controller can be implemented to determine into which storage node of the multiple storage nodes to deploy the serverless UDF. In making such a selection, a number of items can be taken into account. A user device can define metadata for the UDF. For example, the user-defined metadata can include a parameter to indicate whether the given UDF, which is registered, uses special hardware like a GPU to run. Further, the UDF service of storage-side UDF system can analyze the following: UDF metadata information, such as, but not limited to, whether a GPU is preferred to run the UDF; event information, such as, but not limited to, the identity of one or more storage buckets and objects mapped to these storage buckets; storage system information such as, but not limited to, which storage node controls or is close to the storage bucket containing the data and does this storage node have the proper instrumentality such as a GPU. From the automated analysis, the UDF service can send, to an orchestrator, the identified storage node, as the selected most appropriate storage node to run the UDF, and the corresponding storage information. The orchestrator can then deploy the UDFs to the selected storage node based on the selection information. The orchestrator can be implemented as a set of instructions, which can include operation as a serverless platform that can offer database and storage services.
[0093] In various embodiments, internal storage information can be used to influence the serverless UDF’s scheduling process. Storage awareness features related to automating deployment, scaling, and management of containerized applications can be implemented via conventional advanced scheduling service by setting configuration and calling a runtime scheduler extender for serverless functions, for example running as pods, to deploy and run a matched storage- side UDF on the most desired storage node of the storage system.
[0094] Figure 4 shows an example of a sequence of a UDF deployment with respect to node selection. The sequence can be performed in a storage system in an architecture similar or identical to architecture 100 of Figure 1, where the storage system includes multiple storage nodes 405-1 . . . 405-N. Storage node 405-1 can include a storage 410-1 having storage services 412-1 and a storage-
side UDF controller 430 -1. The storage-side UDF controller 430-1 can be implemented as a NDP server. The storage-side UDF controller 430-1 can include a NDP service 414-1, a UDF service 416-1, and an event listener service 419-1. Similarly, the storage node 405-N can include a storage 410-N having storage services 412-N and a storage-side UDF controller 430-N. The storage- side UDF controller 430-N can be implemented as a NDP server. The storage- side UDF controller 430-N can include a NDP service 414-N, a UDF service 416-N, and an event listener service 419-N. An UDF orchestrator 440 and an event target cluster 420 can be implemented on the storage-side of a compute system-storage system architecture.
[0095] At operation 1 of the sequence, the event target system cluster 420 on the storage-side of the storage system calls back the event listener service 419-1 for notification. At operation 2 of the sequence, the event listener service 419-1 calls UDF service 416-1 for UDF invocations with storage bucket, objects, and UDF information. At operation 3 of the sequence, the UDF service 416-1 calls UDF registry 422 to find the matched UDFs and their metadata information.
The UDF registry 422 can also function as an event target cluster or as a portion of an event target cluster similar to UDF registry 122 of the event target cluster 120 of Figure 1. If a match is found, then, at operation 4 of the sequence, the UDF service 416-1 calls NDP service 414-1 for storage information.
[0096] At operation 5 of the sequence, the NDP service 414-1 calls storage system to get storage object location and resource allocation information via storage system SDK with the storage bucket, objects, and UDF information from the event listener service 419-1. At operation 6 of the sequence, the UDF service 416-1 can send the storage information to the orchestrator 440. At operation 6 of the sequence, the orchestrator 440 can deploy serverless functions to the optimal storage node. For example, the orchestrator 440 can deploy serverless function x 450-1 and serverless function y 450-2 to storage node 405- 1, while deploying serverless function z 450-3 to storage node 405-N.
[0097] Storage node selection for server less UDFs can be realized using storage awareness features that can be implemented into container orchestration via a scheduling service by setting configuration files and calling a runtime scheduler for running serverless UDF as pods. Container orchestration can involve automation of all aspects of coordinating and managing containers and
can be focused on managing the life cycle of containers and their dynamic environments. There are a number of tasks that can be automated via container orchestration. An orchestrator can configure and schedule containers, provision and deploy containers, and manage availability of containers. An orchestrator can control configuration of applications in terms of the containers in which these applications run and can control scaling of containers to equally balance application workloads across an infrastructure. An orchestrator can manage allocation of resources between containers, manage load balancing, traffic routing and service discovery of containers, manage health monitoring of containers, and manage security of interactions between containers. An orchestrator can be implemented via instructions stored in a memory device and executed by one or more processors.
[0098] The selection of a storage node from among multiple storage nodes of a storage system to execute a matched UDF can include multiple activities. The multiple activities of an example procedure can include a filter process, a scoring process, a select process based on the scoring results of the scoring process, and a binding process. The filter process can include using a filter that can label a node as a pod, and then filter out nodes that are not qualified. The filter can perform pod-node matching that checks conditions of the node and matching of preferences and requirements for the node, while considering preferences and instrumentality not to be included. The pod-node matching can be performed during scheduling rather than during UDF execution. The filter process can include pod matching to consider topology of appropriate nodes (pod affinity) and inappropriate nodes (pod anti-affinity). The filter process can include pod distribution that checks service affinity, which allows users to define an affinity or anti-affinity relationship of a service to another service. The filter process can include a hardware filter to identify whether a node has special hardware, for example, but not limited to, a GPU.
[0099] In some instances, the filter process can include a sampling procedure with respect to the nodes associated with the storage system. For example, a configuration ratio, R, can be set to an amount less than the total number of nodes of the storage system. If a number of nodes of a cluster is N, then the number of nodes to find can be set to the maximum of (N*R, Min), where Min is a default minimum. The default minimum is less than the total number N.
Consider the configuration ratio R for a large number of nodes can be set to, but is not limited to, 10%. If the number of nodes N of an cluster, in this example, is 3,000, then the number of nodes to find can be set to the maximum of (3000*10/100, 100), where 100 is the default minimum. Another example of finding a configuration number of nodes is to consider a range relative to a selected number. The number of nodes to find can be set to the maximum of (range- (total number of nodes in the cluster/the selected number). In a non limiting example, the number of nodes to find can be set to the maximum of ((5,50)-(the number of nodes in the cluster)/125). In other instances, all nodes may be considered in the filtering and scoring procedures.
[0100] The scoring process can include assigning a weight for node preference in a pod configuration during scheduling. The weight can be a number assigned within a specified range. For example, the score can be a number assigned in the range (1-100) for node preference in the pod configuration. The scheduler, which can be an orchestrator, can compute a sum for a given node by iterating through elements of preference fields for a storage-side UDF and adding a weight to the sum if preferences of the given node match corresponding elements in the iteration. This score is then combined with the scores of other priority UDFs for the node. A node or multiple nodes with the highest total score can be labeled as most preferred.
[0101] The following scenarios are examples of preferences considered in the node selection procedure. An example first scenario can include a given UDF having a parameter set to a preference to run on a GPU and on the closest storage node, where the GPU preference has a higher score weight. An example second scenario can include a given UDF having a parameter set to node affinity, which is a preference to run on a storage node that is the closest node to the data on which the given UDF is to operate. An example third scenario can include a given UDF having a parameter set to a node affinity for requiring a GPU node. An example fourth scenario can include a first UDF and a second UDF having a parameter set to always run the first and second UDF in sequence, which can lead to a preference for the two UDFs to be co-deployed for the same storage node, which is an example of pod affinity. An example fifth scenario can include a first UDF and a second UDF having a parameter set to deploy the first UDF and the second UDF on different nodes based on both the first and
second UDFs being CPU-intensive. Different pods can have different processing instruments. Each of these scenarios can include use of weights in the scoring process.
[0102] The scoring process can include a scheduling algorithm related to node specific scoring for resource usage. Examples of resource usage scoring can include priorities associated with preferential distribution, preferential stacking, fragmentation rate, and node spread. With respect to preferential distribution, an idle resource rate can be considered. The idle resource rate can be generated as a ratio of a difference between ahocatable and request to the ahocatable. Request indicates the resources that have been allocated to nodes. Ahocatable indicates the resources that can be scheduled to nodes. The higher the idle resource rate, the higher is the score.
[0103] With respect to preferential stacking, a resource usage can be considered. The resource usage can be generated as a ratio of request to ahocatable. Request indicates the resources that have been allocated to nodes. Ahocatable indicates the resources that can be scheduled to nodes. The higher the resource usage, the higher is the score.
[0104] With respect to fragmentation rate, a processor and memory resources can be considered. The fragmentation rate can be generated as an absolute value of a difference between CPU usage and memory usage. The CPU usage is taken as a function of a ratio of request to ahocatable and the memory usage as a function of the ratio of request to ahocatable. Request indicates the resources that have been allocated to nodes. Ahocatable indicates the resources that can be scheduled to nodes. The higher the fragmentation rate, the lower is the score. [0105] With respect to node spread, a number of pods in a container to deploy for the UDFs matched to detected storage events can be considered. The node spread can be generated as a ratio of a difference between the total number T of pods in a container and a statistical value N of nodes to the total number T. The higher the node spread, the higher is the score.
[0106] Once a score is generated for each considered node, one or more of the nodes is selected as a storage-side host to execute one or more storage-side UDFs. Once the storage-side host is selected, the information for the UDF execution can be bound to the selected storage- side host as a pod.
[0107] Figure 5 illustrates an example of factors that influence a decision in a
storage-side serverless scheduler 540 to select a storage node for UDF execution. The serverless scheduler 540 can be implemented as an orchestrator in a architecture similar to the architectures of Figures 1 and 4. The serverless scheduler 540 can receive UDF metadata information 562, storage node labels 564, runtime event information 566, and runtime storage system information 568. The UDF metadata information 562 can include UDF configuration data, The storage node labels 564 can include node configurations. The runtime event information 566 can include information on objects, buckets, and storage actions. The runtime storage system information 568 can include bucket and object owner information.
[0108] The information provided to the schedular 540 can be used to in a node score calculation algorithm. A non-limiting example can include a score for each node generated as
Score = (weightage * PriorityFunctionl) + (weightage * Priority Function2) + . , where the priority functions take into consideration the preferences and priorities of the UDFs for the given node. In pod scheduling, the schedular 540 selects the node with the highest score to schedule the pod. The highest score can be a highest normalized score. If multiple nodes have the same highest score, then the schedular 540 can pick one randomly. Alternatively, If multiple nodes have the same highest score, then the scheduler 540 can pick more than one randomly. [0109] In various embodiments, a method and a system for invoking storage- side UDFs are provided that are resilient and orchestrated based on storage events to form a fully-automated data pipeline. Methods can include registration of storage-side UDFs from a user device. An interface of UDF configurations can be introduced to include one or more trigger conditions, user preferences in storage resources, and SLA. For utilization with standard storage protocols, an add-on metadata section can be added to the standard storage protocols in a defined and separable manner.
[0110] Such methods for using storage-side UDFs can include storage and retrieval of UDFs from a storage-side UDF registry. A storage-side private UDF registry can be introduced to provide capability for extra security measures, especially in network security. Some embodiments providing enhanced security can include no internet exposure, sensitive data stays in the storage system, avoidance of attacks like man-in-the middle attacks, avoidance of UDF
tempering, and other security features. With a top concern of running storage- side UDF being security, storage-side execution of UDFs can avoid a co on practice of using cloud-based public registry that might expose security risks. Storage-side UDF registry can also provide better performance because some UDF data, such as images, can be large, and storage-side UDF registry can reduce external network transportation.
[0111] Such methods for using storage-side UDFs can include the processing and analysis of event notifications, UDFs, and target storage objects. A mechanism can be provided to collect and analyze multi-dimensional metadata information from storage systems, storage objects, events, and UDFs. Since the services for the storage-side maintain, storage, and execution of UDFs are running on storage systems, internal information of the storage system can include information regarding internal storage caching, internal storage partitioning, sharding, and indexing, and an internal storage data protection scheme. With respect to internal storage caching, if the storage objects used in UDF are already in cache and validated, accessing disks and drives to obtain them again can be avoided, which can save storage input/output (I/O) and shorten response time. With respect to internal storage partitioning, sharding, and indexing, if storage object partitions don’t contain the information that a UDF uses, these partitions can be skipped entirely, which can save storage I/O and improve performance. With respect to internal storage data protection scheme, if the storage system has internal replicas from the main data source, execution of the UDFs can be arranged for the UDFs to operate on the secondary objects, confirmed as being the same as primary, directly, which does not impact the main production data source and further improves performance.
[0112] Such methods can include selecting one or more storage nodes for UDF deployment and invocation. The UDF can operate on images or big data. An orchestration extension can be provided to explicitly send internal analyzed storage information to influence the serverless UDF’s scheduling process. Storage awareness features can be implemented via the advanced scheduling service of this feature by setting configuration and calling the runtime scheduler extender for serverless functions, running as pods, to deploy and run UDF on the most desired storage node.
[0113] A system for invoking storage-side UDFs, which is resilient and can be
orchestrated based on storage events to form a fully-automated data pipeline, can be realized by a single storage system or multiple distributed storage systems. Each storage system can include multiple resilient storage nodes of any type of storage systems, where each storage node can contain software package plugins to support storage-side serverless UDFs. Such software package plugins can include instructions, executable by a processor, for receiving registration of storage-side UDFs via REST services.
[0114] Such software package plugins can include instructions, executable by a processor, for serving requests to retrieve runtime information for the storage system including but not restricted to internal and public information of the storage system. Such internal and public information can include, but is not limited to, storage locality, storage system resources, caching, indexing, data protection, service oriented architecture (SOA) policy, and other relevant data associated with the storage system.
[0115] Such software package plugins can include instructions, executable by a processor, for processing requests that invocate storage-side UDFs that operate on storage objects. The instructions can include storing result objects, from the UDF operations on the storage objects, in memory or storage media via storage processors or storage nodes.
[0116] Figure 6 is a flow diagram of features of an embodiment of an example method 600 of storage-side computation. The method 600 can be performed using a processing module in a computer-implemented method, where the processing module has one or more storage processors executing stored instructions. The one or more one or more storage processors can be structured having tasks to store, retrieve, and protect data. At operation 610, a data storage event is detected in a storage system, where the data storage event was initiated from exterior to the storage system. The storage system can be arranged in an architecture of a compute system separated from the storage system. In general operation, the compute system can generate data and interface with a user device to store and retrieve data from the storage system. The storage system can be located locally with the compute system or remotely from the compute system. Communication between the compute system and the storage system can be implemented over local communication and interface instrumentalities or over a communication network. The storage system can have one or more storage
nodes, where each storage node has memory and one or more storage processors. The storage system can also contain data storage equipment in the form of disk drives, where the disk drives are either external to the storage nodes or as part of the storage nodes. Each storage node of the storage system can control a certain portion of data storage, and can manage operations on data including operations to store, delete, access, copy, and perform other data operations. These operations can be performed with respect to storage objects on the disk drives associated with the respective storage node.
[0117] Notification of data storage events in a storage node can be conducted via a storage processor for the storage node, where the storage processor or associated memory or circuitry has logic to monitor these events. For example, the logic can monitor request and command structures received in the storage node along with monitoring metadata of the request and command structures and data of the data storage event. The notification can be made to a control module in the storage system, where the control module includes one or more processors to execute stored instructions to perform as an orchestrator for the storage nodes of the storage system. In some embodiments, the notification can also be conveyed to the compute system of the architecture.
[0118] At operation 620, metadata from the detected data storage event is extracted. Extracting the metadata from the data storage event can include electronically reading requests, commands, or data received in the storage system from exterior to the storage system. The source of the metadata of the data storage event can include one or more of a client user device and one or more compute systems.
[0119] At operation 630, in response to detecting the data storage event in the storage system and after completing the data storage event, a UDF is automatically invoked directly within the storage system, based on the metadata extracted, where the UDF resides within the storage system. Automatically invoking the UDF can include initiating the invocation with notifying services within the storage system. The UDF can automatically run directly within the storage system. A result of operation of the UDF in the storage system can be stored in the storage system upon completing generation of the result without providing the result to a client source or a compute system that was part of initiation of the data storage event. The result may be provided to a user device
via a compute system associated with the storage system upon determination of the result, depending on the parameters of the UDF, or at some later time in response to a request for the result.
[0120] Variations of method 600 or methods similar to the method 600 can include a number of different embodiments that may be combined depending on the application of such methods and/or the architecture of devices or systems in which such methods are implemented. Such methods can include registering, in the storage system, the UDF prior to detecting the data storage event. The UDF can be registered in the storage system prior to automatically invoking and running the UDF. Registering the UDF can include registering, for the UDF, data for matching to one or more of the metadata from the data storage event detected, a trigger condition to respond to the detection of the data storage event, one or more user preferences for use of storage resources of the storage system, or a SLA.
[0121] Variations of method 600 or methods similar to the method 600 can include storing UDFs and parameters for the UDFs in a UDF registry storage in the storage system. The UDF registry storage can be implemented as a specific storage volume in the storage system. In a storage system having multiple storage nodes, each storage node can have a UDF registry storage. The UDF registry storage of each storage node can be arranged to store the same UDF or UDF information or one or more of the storage nodes can be arranged to store specific UDF or UDF information that can be different than that stored on other storage nodes of the storage system. The storage system, on which the UDF can be run, can be a storage network that connects and uses multiple storage systems. The multiple storage systems can be structured in a data center.
[0122] Variations of method 600 or methods similar to the method 600 can include providing security measures, specific to the UDFs, in storing the UDFs in the UDF registry or in retrieving the UDFs from the UDF registry. Variations can include, based on the detection of the data storage event in the storage system, processing and analyzing an event notification, one or more UDFs, or one or more target storage objects.
[0123] Variations can include performing operations with the storage system structured as a single node with multiple storage nodes or as multiple storage systems, where each of the multiple storage systems can have one or more
storage nodes. Variations can include scheduling operation of the UDF and selecting one storage node of multiple storage nodes in the storage system on which to run the UDF. To choose a node, a node score can be generated to select the one storage node using one or more of metadata of the UDF, node configurations of the multiple storage nodes, runtime event information, or runtime storage system information. The detecting of the data storage event and the automatic invoking of the UDF can be performed directly within the storage system in one node of multiple storage nodes of the storage system. The detecting of the data storage event and the automatic invoking of the UDF can be performed directly within the storage system in one node of multiple storage nodes of the storage system.
[0124] Performing the detecting of the data storage event and the automatic invoking and running of the UDF directly within the storage system can include using a storage-side protocol of automatic invoking and running of the UDF directly within the storage system integrated with one or more protocols that perform storage operations in the storage system.
[0125] In various embodiments, a non-transitory machine-readable storage device, such as computer-readable non-transitory medium, can comprise instructions stored thereon, which, when performed by a machine, cause the machine to perform operations, where the operations comprise one or more features similar to or identical to features of methods and techniques described with respect to method 600, variations thereof, and/or features of other methods taught herein such as associated with Figures 1-7. The physical structures of such instructions may be operated on by one or more storage processors. For example, executing these physical structures can cause the machine to perform operations comprising detecting a data storage event in a storage system, the data storage event initiated from exterior to the storage system; extracting metadata from the data storage event detected; and in response to detecting the data storage event in the storage system and after completing the data storage event, automatically invoking a UDF directly within the storage system, based on the metadata extracted, with the UDF residing within the storage system.
[0126] Operations can include storing, in the storage system, a result of operation of the UDF in the storage system upon completing generation of the result without providing the result to a client source or a compute system that
was part of initiation of the data storage event. The result may be provided to a user device via a compute system associated with the storage system upon determination of the result, depending on the parameters of the UDF, or at some later time in response to a request for the result.
[0127] Operations executed by the one or more processors can include registering, in the storage system, the UDF prior to detecting the data storage event. Registering the UDF can include registering, for the UDF, data for matching to one or more of the metadata from the data storage event detected, a trigger condition to respond to the detection of the data storage event, one or more user preferences for use of storage resources of the storage system, or a SLA. Operations executed by the one or more processors can include storing UDFs and parameters for the UDFs in a UDF registry storage in the storage system.
[0128] Operations executed by the one or more processors of the storage system can include providing security measures, specific to the UDFs, in storing the UDFs in the UDF registry or in retrieving the UDFs from the UDF registry. Operations can include, based on the detection of the data storage event in the storage system, processing and analyzing an event notification, one or more UDFs, or one or more target storage objects.
[0129] Operations executed by the one or more processors of the storage system can include computer-implemented method includes scheduling an operation of the UDF and selecting one storage node of multiple storage nodes in the storage system on which to run the UDF. Operations can include generating a node score to select the one storage node using one or more of metadata of the UDF, node configurations of the multiple storage nodes, runtime event information, or runtime storage system information.
[0130] Operations can include performing the detecting of the data storage event and the automatic invoking of the UDF directly within the storage system in one node of multiple storage nodes of the storage system. Operations can include performing the detecting of the data storage event and the automatic invoking of the UDF directly within the storage system including using a storage-side protocol of automatic invoking of the UDF directly within the storage system integrated with one or more protocols that perform storage operations in the storage system.
[0131] In various embodiments, a storage system can comprise a memory storing instructions and one or more processors in communication with the memory, where the one or more storage processors execute the instructions. The instructions include instructions to detect a data storage event in the storage system, where the data storage event is initiated from exterior to the storage system, and extract metadata from the data storage event detected. In response to detection of the data storage event in the storage system and after completion of the data storage event, a UDF is automatically invoked directly within the storage system, based on the metadata extracted, with the UDF residing within the storage system. The one or more storage processors can be structured to be operable to store, in the storage system, a result of operation of the UDF in the storage system upon completing generation of the result without providing the result to a client source or a compute system that was part of initiation of the data storage event.
[0132] Variations of such a storage system or similar systems can include a number of different embodiments that may or may not be combined depending on the application of such systems and/or the architecture of systems in which methods, as taught herein, are implemented. The storage system can be arranged in an architecture of a compute system separated from the storage system. In general operation, the compute system can generate data and interface with a user device to store and retrieve data form the storage system. The storage system can be located locally with the compute system or remotely from the compute system. Communication between the compute system and the storage system can be implemented over local communication and interface instrumentalities or over a communication network. The storage system can have one or more storage nodes, where each storage node has memory and one or more storage processors. The storage system can also contain data storage equipment in the form of disk drives, where the disk drives are either external to the storage nodes or as part of the storage nodes. Each storage node of the storage system can control certain portion of data storage, and can manage operations on data including operations to store, delete, access, copy, and perform other data operations. These operations can be performed with respect to storage objects on the disk drives associated with the respective storage node. [0133] Notification of events in a storage node can be conducted via a storage
processor for the storage node, where the storage processor or associated memory or circuitry has logic to monitor these events. The notification can be made to a control module in the storage system, where the control module includes one or more processors and stored instructed to perform as a orchestrator for the storage nodes of the storage system. In some embodiments, the notification can be conveyed to the compute system of the architecture. [0134] Variations of such a storage system or similar systems can include the one or more storage processors structured to be operable to execute stored instructions to register the UDF in the storage system prior to detection of the data storage event. The registration of the UDF can include registration of one or more parameters for the UDF including data for matching to the metadata from the data storage event detected, a trigger condition to respond to the detection of the storage event, one or more user preferences for use of storage resources of the storage system, or a SLA. Such a storage system can include a UDF registry that stores UDFs and parameters for the UDFs.
[0135] Variations of such a storage system or similar systems can include the one or more storage processors structured to be operable to provide security measures, specific to the UDFs, to storage of the UDFs in the UDF registry or to retrieval of the UDFs from the UDF registry. The one or more storage processors can be structured to be operable to execute the instructions to, based on the detection of the storage event in the storage system, process and analyze an event notification, one or more UDFs, or one or more target storage objects. [0136] Variations of such a storage system or similar systems can include the one or more storage processors structured to be operable to execute the instructions to orchestrate scheduling an operation of the UDF and select one storage node of multiple storage nodes in the storage system on which to run the UDF. The one or more storage processors can be operable to execute the instructions to generate to generate a node score to select the one storage node using one or more of metadata of the UDF, node configurations of the multiple storage nodes, runtime event information, or runtime storage system information.
[0137] Variations of such a storage system or similar systems can include multiple nodes with each node having a storage processor operable to detect a specific data storage event for the node and to execute a specific UDF in the
node in response to detection of the specific data storage event in the node. In various embodiments, the instructions to automatically invoke a UDF directly within the storage system in response to detection of a data storage event in the storage system are portable to different types of storage systems using different storage protocols and capable of operation with one or more protocols that perform storage operations in the storage system.
[0138] Figure 7 is a block diagram illustrating components of an example system 700 that can implement algorithms and perform methods structured to conduct real-time event-driven serverless functions within storage systems, as taught herein. The system 700 can be implemented in a compute system - storage system architecture. The system 700 can include one or more processors 750 that can be structured to execute stored instructions to perform functions of a storage system having a source-side UDF controller for performing real-time event-driven serverless functions within the storage system. The one or more processors 750 can be storage processors. The source-side UDF controller can be implemented as one or more processors and memory with stored instructions for automatically executing UDFs within the storage system as taught herein, with the source-side UDF controller in communication with one or more servers within the storage system. A source-side UDF controller can be implemented as instructions in a number of storage servers of the storage system to automatically execute UDFs within the storage system as taught herein. The source-side UDF controller can be implemented as instructions in each storage server of the storage system. The source-side UDF controller implemented as instructions in the storage system can be realized with plug-in software. The one or more processors 750 can be realized by hardware processors.
[0139] The system 700 may operate as a standalone system or may be connected, for example networked, to other systems. In a networked deployment, the system 700 may operate in the capacity of a server machine, a client machine, or both in server-client network environments. In an example, the system 700 may act as a peer machine in peer-to-peer (P2P) (or other distributed) network environment. Further, while only a single system is illustrated, the term “system” shall also be taken to include any collection of systems that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein, such as cloud
computing, software as a service (SaaS), or other computing cluster configurations. The example system 700 can be arranged to operate with one or more other devices structured to perform real-time event-driven serverless user- defined functions within a storage system as taught herein.
[0140] Along with the one or more processors 750 (e.g., a CPU, a GPU, a hardware processor core, or any combination thereof), the system 700 can include a main memory 754, and a static memory 756, some or all of which may communicate with each other via a communication link 758. The communication link (e.g., bus) 758 can be implemented as a bus, a local link, a network, other communication path, or combinations thereof. The system 700 may further include a display device 760, an input device 762 (e.g., a keyboard), a user interface (UI) navigation device 764 (e.g., a mouse), and a signal generation device 768 (e.g., a speaker). In an example, the display device 760, input device 762, and UI navigation device 764 can be a touch screen display. The system 700 can include an output controller 769, such as a serial (e.g., USB, parallel, or other wired or wireless (e.g., infrared (IR), near field communication (NFC), etc.) connection to communicate or control one or more peripheral devices (e.g., a printer, card reader, etc.). In an example, the display device 760, input device 762, and UI navigation device 764 can be a touch screen display. The system 700 can include an output controller 769. The system 700 can include one or more sensors 766 as IoT clients on the compute-side of the system. The display device 760, the input device 762, the UI navigation device 764, the signal generation device 768, the output controller 969, and the sensors 966 can be structured as part of a compute system in a compute system - storage system architecture for the system 700.
[0141] The system 700 can include a machine-readable medium 752 on which is stored one or more sets of data structures or instructions 755 (e.g., software or data) embodying or utilized by the system 700 to perform any one or more of the techniques or functions for which the system 700 is designed, including controlling, storing, and automatically executing storage-side UDFs with the storage system, where the storage-side UDFs can be registered with respect to one or more storage events. The instructions 755 or other data stored on the machine -readable medium 752 can be accessed by the main memory 754 for use by the one or more processors 750. The instructions 755 may also reside,
completely or at least partially, within the main memory 754, within the static memory 756, within a mass storage 751, or within the one or more processors 750.
[0142] While the machine-readable medium 752 is illustrated as a single medium, the term "machine-readable medium" can include a single medium or multiple media (e.g., a centralized or distributed database, or associated caches and servers) configured to store the instructions 755 or data. The term “machine-readable medium” can include any medium that is capable of storing, encoding, or carrying instructions for execution by the system 700 and that cause the system 700 to perform any one or more of the techniques to which the system 700 is designed, or that is capable of storing, encoding or carrying data structures used by or associated with such instructions. Non-limiting machine- readable medium examples can include solid-state memories, optical media, and magnetic media.
[0143] The data from or stored in machine-readable medium 752 or main memory 754 can be transmitted or received over a communications network 759 using a transmission medium via a network interface device 753 utilizing any one of a number of transfer protocols (e.g., frame relay, Internet protocol (IP), transmission control protocol (TCP), user datagram protocol (UDP), hypertext transfer protocol (HTTP), etc.). Example communication networks can include a local area network (LAN), a wide area network (WAN), a packet data network (e.g., the Internet), mobile telephone networks (e.g., cellular networks), Plain Old Telephone (POTS) networks, and wireless data networks (e.g., Institute of Electrical and Electronics Engineers (IEEE) 802.11 family of standards known as Wi-Fi®, IEEE 802.16 family of standards known as WiMax®), IEEE 802.15.4 family of standards, peer-to-peer (P2P) networks, among others. In an example, the network interface device 753 can include one or more physical jacks (e.g., Ethernet, coaxial, or phone jacks) or one or more antennas to connect to the communications network. In an example, the network interface device 753 can include a plurality of antennas to wirelessly communicate using at least one of single-input multiple-output (SIMO), multiple-input multiple-output (MIMO), or multiple-input single-output (MISO) techniques. The term “transmission medium” shall be taken to include any tangible medium that is capable of carrying instructions to and for execution by the system 700, and
includes instrumentalities to propagate digital or analog communications signals to facilitate communication of such instructions, which instructions may be implemented by software. The network interface device 753 can operate in conjunction with the network 759 to communicate between a storage system or components of the storage system and a compute system or components of the compute system in a compute system - storage system architecture. The system 700 can be implemented in a cloud environment.
[0144] The components of the illustrative devices, systems, and methods employed in accordance with the illustrated embodiments can be implemented, at least in part, in digital electronic circuitry, analog electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. These components can be implemented, for example, as a computer program product such as a computer program, program code or computer instructions tangibly embodied in a machine-readable storage device, for execution by, or to control the operation of, data processing apparatus such as a programmable processor, a computer, or multiple computers.
[0145] The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general-purpose processor, a digital signal processor (DSP), an ASIC, a FPGA (field-programmable gate array) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
[0146] Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random-access memory or both. The elements of a computer include a processor for executing instructions and one or more memory devices for storing instructions and data.
Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. The processor and the memory can be supplemented by, or incorporated in special purpose logic circuitry.
[0147] An architecture and protocol is provided that can allow a user to define UDFs in the storage-side of a compute-side-storage-side architecture. This architecture and protocol can provide a fully automated real-time event-driven data pipeline with efficient data processing, extra security, and lower operational cost. The event-driven and storage-aware serverless function architecture, protocol, and related techniques can be implemented to attain a fully-automated execution plan that is most appropriate for storage-side serverless functions, based on storage resource allocations. Such an architecture and protocol or similar architecture and protocol can reduce network I/O, take full advantage of storage resource, speed up overall processing time, reduce operational cost, and improve data security.
[0148] Although specific embodiments have been illustrated and described herein, it will be appreciated by those of ordinary skill in the art that any arrangement that is calculated to achieve the same purpose may be substituted for the specific embodiments shown. Various embodiments use permutations and/or combinations of embodiments described herein. The above description is intended to be illustrative, and not restrictive, and that the phraseology or terminology employed herein is for the purpose of description. Combinations of the above embodiments and other embodiments will be apparent to those of skill in the art upon studying the above description.
Claims
1. A storage system comprising: a memory storing instructions; and one or more storage processors in communication with the memory, wherein the one or more storage processors execute the instructions to: detect a data storage event in the storage system, the data storage event initiated from exterior to the storage system; extract metadata from the data storage event detected; and in response to detection of the data storage event in the storage system and after completion of the data storage event, automatically invoke a user-defined function directly within the storage system, based on the metadata extracted, with the user-defined function residing within the storage system.
2. The storage system of claim 1, wherein the one or more storage processors are operable to store, in the storage system, a result of operation of the user-defined function in the storage system upon completing generation of the result without providing the result to a client source or a compute system that was part of initiation of the data storage event.
3. The storage system of any one of the preceding claims, wherein the one or more storage processors are operable to execute stored instructions to register the user-defined function in the storage system prior to detection of the data storage event.
4. The storage system of claim 3, wherein the registration of the user- defined function includes registration of one or more parameters for the user- defined function including data for matching to the metadata from the data storage event detected, a trigger condition to respond to the detection of the storage event, one or more user preferences for use of storage resources of the storage system, or a service-level agreement.
5. The storage system of any one of the preceding claims, wherein the storage system includes a user-defined function registry that stores user-defined functions and parameters for the user-defined functions.
6. The storage system of claim 5, wherein the one or more storage processors are operable to provide security measures, specific to the user-defined functions, to storage of the user-defined functions in the user-defined function registry or to retrieval of the user-defined functions from the user-defined function registry.
7. The storage system of any one of the preceding claims, wherein the one or more storage processors are operable to execute the instructions to process and analyze an event notification, one or more user-defined functions, or one or more target storage objects, based on the detection of the data storage event in the storage system.
8. The storage system of any one of the preceding claims, wherein the one or more storage processors are operable to execute the instructions to orchestrate scheduling an operation of the user-defined function and select one storage node of multiple storage nodes in the storage system on which to run the user-defined function.
9. The storage system of claim 8, wherein the one or more storage processors are operable to execute the instructions to generate a node score to select the one storage node using one or more of metadata of the user-defined function, node configurations of the multiple storage nodes, runtime event information, or runtime storage system information.
10. The storage system of any one of the preceding claims, wherein the storage system includes multiple nodes with each node having a storage processor operable to detect a specific data storage event for the node and execute a specific user-defined function in the node in response to detection of the specific data storage event in the node.
11. The storage system of any one of the preceding claims, wherein the instructions to automatically invoke a user-defined function directly within the storage system in response to detection of a data storage event in the storage system are portable to different types of storage systems using different storage protocols and capable of operation with one or more protocols that perform storage operations in the storage system.
12. A computer-implemented method of storage-side computation, the computer-implemented method comprising: detecting a data storage event in a storage system, the data storage event initiated from exterior to the storage system; extracting metadata from the data storage event detected; and in response to detecting the data storage event in the storage system and after completing the data storage event, automatically invoking a user-defined function directly within the storage system, based on the metadata extracted, with the user-defined function residing within the storage system.
13. The computer-implemented method of claim 12, wherein the computer- implemented method includes storing, in the storage system, a result of operation of the user-defined function in the storage system upon completing generation of the result without providing the result to a client source or a compute system that was part of initiation of the data storage event.
14. The computer-implemented method of claim 12 or claim 13, wherein the computer-implemented method includes registering, in the storage system, the user-defined function prior to detecting the data storage event.
15. The computer-implemented method of 14, wherein registering the user- defined function includes registering, for the user-defined function, data for matching to one or more of the metadata from the data storage event detected, a trigger condition to respond to the detection of the data storage event, one or more user preferences for use of storage resources of the storage system, or a service-level agreement.
16. The computer-implemented method of any one of claims 12-15, wherein the computer-implemented method includes storing user-defined functions and parameters for the user-defined functions in a user-defined function registry storage in the storage system.
17. The computer-implemented method of claim 16, wherein the computer- implemented method includes providing security measures, specific to the user- defined functions, in storing the user-defined functions in the user-defined function registry or in retrieving the user-defined functions from the user-defined function registry.
18. The computer-implemented method of any one of claims 12-17, wherein the computer-implemented method includes, based on the detection of the data storage event in the storage system, processing and analyzing an event notification, one or more user-defined functions, or one or more target storage objects.
19. The computer-implemented method of any one of claims 12-18, wherein the computer-implemented method includes scheduling an operation of the user- defined function and selecting one storage node of multiple storage nodes in the storage system on which to run the user-defined function.
20. The computer-implemented method of claim 19, wherein the computer- implemented method includes generating a node score to select the one storage node using one or more of metadata of the user-defined function, node configurations of the multiple storage nodes, runtime event information, or runtime storage system information.
21. The computer-implemented method of any one of claims 12-20, wherein the computer-implemented method includes performing the detecting of the data storage event and the automatic invoking of the user-defined function directly within the storage system in one node of multiple storage nodes of the storage system.
22. The computer-implemented method of any one of claims 12-21, wherein performing the detecting of the data storage event and the automatic invoking of the user-defined function directly within the storage system includes using a storage-side protocol of automatic invoking of the user-defined function directly within the storage system integrated with one or more protocols that perform storage operations in the storage system.
23. A non-transitory computer-readable storage medium storing instructions, wherein the instructions, when executed by one or more storage processors, cause the one or more storage processors to perform operations comprising any one of the methods of claims 12-22.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US2021/071236 WO2022225578A1 (en) | 2021-08-20 | 2021-08-20 | Real-time event-driven serverless functions within storage systems for near data processing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US2021/071236 WO2022225578A1 (en) | 2021-08-20 | 2021-08-20 | Real-time event-driven serverless functions within storage systems for near data processing |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022225578A1 true WO2022225578A1 (en) | 2022-10-27 |
Family
ID=77802283
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2021/071236 WO2022225578A1 (en) | 2021-08-20 | 2021-08-20 | Real-time event-driven serverless functions within storage systems for near data processing |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2022225578A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2024184941A1 (en) * | 2023-03-03 | 2024-09-12 | 三菱電機株式会社 | Function transmission device, function execution device, function transmission method, function execution method, function transmission program, and function execution program |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019109023A1 (en) * | 2017-11-30 | 2019-06-06 | Cisco Technology, Inc. | Provisioning using pre-fetched data in serverless computing environments |
US20200327124A1 (en) * | 2019-04-10 | 2020-10-15 | Snowflake Inc. | Internal resource provisioning in database systems |
-
2021
- 2021-08-20 WO PCT/US2021/071236 patent/WO2022225578A1/en active Application Filing
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019109023A1 (en) * | 2017-11-30 | 2019-06-06 | Cisco Technology, Inc. | Provisioning using pre-fetched data in serverless computing environments |
US20200327124A1 (en) * | 2019-04-10 | 2020-10-15 | Snowflake Inc. | Internal resource provisioning in database systems |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2024184941A1 (en) * | 2023-03-03 | 2024-09-12 | 三菱電機株式会社 | Function transmission device, function execution device, function transmission method, function execution method, function transmission program, and function execution program |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7275171B2 (en) | Operating System Customization in On-Demand Network Code Execution Systems | |
Lai et al. | Fedscale: Benchmarking model and system performance of federated learning at scale | |
US10861013B2 (en) | Containerization of network services | |
US11550614B2 (en) | Packaging and deploying algorithms for flexible machine learning | |
Shahid et al. | A comprehensive study of load balancing approaches in the cloud computing environment and a novel fault tolerance approach | |
Truyen et al. | A comprehensive feature comparison study of open-source container orchestration frameworks | |
CN107003906B (en) | Type-to-type analysis of cloud computing technology components | |
EP2898638B1 (en) | High performance data streaming | |
US8909769B2 (en) | Determining optimal component location in a networked computing environment | |
Zhang et al. | Container-VM-PM architecture: A novel architecture for docker container placement | |
US20170123777A1 (en) | Deploying applications on application platforms | |
He et al. | Programming framework and infrastructure for self-adaptation and optimized evolution method for microservice systems in cloud–edge environments | |
Li et al. | DLHub: Simplifying publication, discovery, and use of machine learning models in science | |
Chakraborty et al. | Journey from cloud of things to fog of things: Survey, new trends, and research directions | |
Bachmann | Design and implementation of a fog computing framework | |
US20210203714A1 (en) | System and method for identifying capabilities and limitations of an orchestration based application integration | |
Rogers et al. | Bundle and pool architecture for multi-language, robust, scalable workflow executions | |
Li et al. | Replica-aware task scheduling and load balanced cache placement for delay reduction in multi-cloud environment | |
Dauwe et al. | Multiagent-based data fusion in environmental monitoring networks | |
WO2022225578A1 (en) | Real-time event-driven serverless functions within storage systems for near data processing | |
CN105808354B (en) | The method for setting up interim Hadoop environment using wlan network | |
Ghazali et al. | CLQLMRS: improving cache locality in MapReduce job scheduling using Q-learning | |
Nguyen et al. | Bracelet: Edge-cloud microservice infrastructure for aging scientific instruments | |
De Palma et al. | An OpenWhisk Extension for Topology-Aware Allocation Priority Policies | |
Grzelak et al. | Design and concept of an osmotic analytics platform based on R container |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21772928 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21772928 Country of ref document: EP Kind code of ref document: A1 |