US20230054226A1 - System and method of providing an interactive development platform in a distributed computing environment - Google Patents
System and method of providing an interactive development platform in a distributed computing environment Download PDFInfo
- Publication number
- US20230054226A1 US20230054226A1 US17/982,981 US202217982981A US2023054226A1 US 20230054226 A1 US20230054226 A1 US 20230054226A1 US 202217982981 A US202217982981 A US 202217982981A US 2023054226 A1 US2023054226 A1 US 2023054226A1
- Authority
- US
- United States
- Prior art keywords
- data
- proxy server
- execution
- documents
- cloud
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 230000002452 interceptive effect Effects 0.000 title claims abstract description 118
- 238000011161 development Methods 0.000 title claims abstract description 95
- 238000000034 method Methods 0.000 title claims abstract description 46
- 230000009471 action Effects 0.000 claims description 134
- 238000010801 machine learning Methods 0.000 claims description 46
- 238000012544 monitoring process Methods 0.000 claims description 30
- 230000000116 mitigating effect Effects 0.000 claims description 26
- 230000000903 blocking effect Effects 0.000 claims description 5
- 238000000605 extraction Methods 0.000 claims description 5
- 230000000694 effects Effects 0.000 claims description 3
- SPTMROWPYQRZSX-UHFFFAOYSA-N 2-methyl-N-(5-methyl-3-isoxazolyl)-1,1,4-trioxo-3H-1$l^{6},2-benzothiazine-3-carboxamide Chemical compound O=C1C2=CC=CC=C2S(=O)(=O)N(C)C1C(=O)NC=1C=C(C)ON=1 SPTMROWPYQRZSX-UHFFFAOYSA-N 0.000 description 238
- 238000004458 analytical method Methods 0.000 description 57
- 210000004027 cell Anatomy 0.000 description 54
- 238000012545 processing Methods 0.000 description 34
- 230000015654 memory Effects 0.000 description 32
- 238000003860 storage Methods 0.000 description 27
- 230000008569 process Effects 0.000 description 23
- 238000005457 optimization Methods 0.000 description 17
- 238000004519 manufacturing process Methods 0.000 description 15
- 238000013528 artificial neural network Methods 0.000 description 11
- 238000013523 data management Methods 0.000 description 11
- 238000013515 script Methods 0.000 description 11
- 230000000875 corresponding effect Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 9
- 230000002159 abnormal effect Effects 0.000 description 8
- 210000002569 neuron Anatomy 0.000 description 7
- 230000006870 function Effects 0.000 description 6
- 238000012549 training Methods 0.000 description 6
- 238000011160 research Methods 0.000 description 5
- 208000025174 PANDAS Diseases 0.000 description 4
- 208000021155 Paediatric autoimmune neuropsychiatric disorders associated with streptococcal infection Diseases 0.000 description 4
- 240000004718 Panda Species 0.000 description 4
- 235000016496 Panda oleosa Nutrition 0.000 description 4
- 230000002547 anomalous effect Effects 0.000 description 4
- 230000006399 behavior Effects 0.000 description 4
- 238000004138 cluster model Methods 0.000 description 4
- 206010000117 Abnormal behaviour Diseases 0.000 description 3
- 238000004325 capillary sieving electrophoresis Methods 0.000 description 3
- 238000012217 deletion Methods 0.000 description 3
- 230000037430 deletion Effects 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000008676 import Effects 0.000 description 3
- 230000010354 integration Effects 0.000 description 3
- 238000012423 maintenance Methods 0.000 description 3
- 238000010926 purge Methods 0.000 description 3
- 238000012546 transfer Methods 0.000 description 3
- 230000001960 triggered effect Effects 0.000 description 3
- HPTJABJPZMULFH-UHFFFAOYSA-N 12-[(Cyclohexylcarbamoyl)amino]dodecanoic acid Chemical compound OC(=O)CCCCCCCCCCCNC(=O)NC1CCCCC1 HPTJABJPZMULFH-UHFFFAOYSA-N 0.000 description 2
- 230000004913 activation Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000000295 complement effect Effects 0.000 description 2
- 230000001143 conditioned effect Effects 0.000 description 2
- 238000007418 data mining Methods 0.000 description 2
- 238000012804 iterative process Methods 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003213 activating effect Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 235000014510 cooky Nutrition 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013502 data validation Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000005538 encapsulation Methods 0.000 description 1
- 239000003292 glue Substances 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 235000021190 leftovers Nutrition 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 102220098272 rs878853037 Human genes 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/30—Creation or generation of source code
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/52—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems during program execution, e.g. stack integrity ; Preventing unwanted data erasure; Buffer overflow
- G06F21/54—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems during program execution, e.g. stack integrity ; Preventing unwanted data erasure; Buffer overflow by adding security routines or objects to programs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/554—Detecting local intrusion or implementing counter-measures involving event detection and direct action
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/566—Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/57—Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
- G06F21/577—Assessing vulnerabilities and evaluating computer system security
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/51—Source to source
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/02—Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
- H04L63/0227—Filtering policies
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/10—Network architectures or network communication protocols for network security for controlling access to devices or network resources
- H04L63/102—Entity profiles
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/10—Network architectures or network communication protocols for network security for controlling access to devices or network resources
- H04L63/104—Grouping of entities
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/56—Provisioning of proxy services
- H04L67/564—Enhancement of application control based on intercepted application data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/02—Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
Definitions
- the present invention relates to the growing field of data engineering. More precisely, the present invention relates to providing an interactive development platform in distributed computing environments.
- FIG. 1 is a schematic diagram, depicting a workflow of generating a machine learning (ML) model in a production environment.
- ML machine learning
- a typical process for generating an ML model may include several disciplines.
- One such discipline may be that of a research engineer, or a data scientists, set to analyze or ingest big data.
- Another discipline may be that of a data engineer, who may be set to produce a model (e.g., a NN model), that may be applied on incoming data instances or examples.
- a third discipline may be that of an automation engineer, who may be set to automate training and/or inference of the ML model on large amounts of data, during training, testing and/or operational work modes.
- APIs Application Programming Interfaces
- the process of transferring between data research and production requires utilization of a wide variety of code languages and repositories.
- the different engineers and researchers typically need to employ a variety of development environments.
- Some of these development environments may be implemented on local computing environments (e.g., installed on computers within an organizational computer network), while others may be implemented using remote computing resources and platforms (e.g., Amazon SageMaker notebooks, and the like).
- a data scientist may acquire large amounts of data, to be iteratively validated and/or prepared or processed by a data engineer.
- an ML model e.g., a NN architecture
- creation and/or training of an ML model typically requires multiple iterations and experimentations.
- a data engineer may need to communicate with an automation engineer in an arduous, iterative process, to automate the execution of the ML model.
- Automation engineers normally need to review the data scientists' code, (typically written in an interactive development environment such as Jupyter notebooks), split that code into scripts and then automate the process using a workflow engine that supports these scripts. This typically results in a high-maintenance, communication-extensive iterative process between data teams and engineering teams.
- embodiments of the invention may enable data scientists, data engineers and automation engineers to safely access production data, and run automation workflows via an interactive development environment.
- safely may be used in this context in the sense that embodiments of the invention may enforce one or more scripting rules, and/or one or more run-time rules to mitigate or prevent data breach, and minimize the risk for data exfiltration.
- embodiments of the invention may also determine and/or enforce one or more security rules. Such rules may, for example, dictate what actions need to be done in case a specific data breach has been identified. For example, embodiments may determine that specific users or services would be denied access to data, that specific roles and/or permissions need to be revoked, and the like.
- Embodiments of the invention may include a proxy server for providing an interactive development environment.
- the proxy server may be configured to operatively connect to one or more client modules, each associated with a respective user development platform; receive, from at least one client module of the one or more client modules, one or more interactive computing documents such as Jupyter notebooks.
- Each such interactive computing document may include one or more scripting code elements also referred to herein as cells or code lines.
- the proxy server may be configured to scan the one or more scripting code elements according to a set of predetermined scripting rules; encapsulate the one or more interactive computing documents in one or more data containers, based on said scan; and transmit the one or more data containers to at least one cloud-based platform, to execute at least one scripting code element of the one or more interactive computing documents.
- the proxy server may receive, from the at least one client module, a workflow data element, defining a scheme for executing the one or more interactive computing documents.
- the proxy server may encapsulate the workflow data element with the one or more interactive computing documents in the one or more data containers; and transmit the one or more data containers to at least one cloud-based platform, to execute the at least one scripting code element of one or more interactive computing documents according to the execution scheme, as defined by the workflow data element.
- the scheme for executing the one or more interactive computing documents may include at least one of: (a) scheduling of execution of one or more interactive computing documents; (b) an order of execution of one or more interactive computing documents; (c) a condition for executing one or more interactive computing documents; and (d) concurrence of execution of two or more interactive computing documents.
- the one or more interactive computing documents may include document metadata.
- the proxy server may be configured to scan the document metadata according to a set of predetermined metadata rules; and encapsulate the workflow data element and the one or more interactive computing documents in one or more data containers, based on said scan of document metadata (e.g., when the document metadata was approved by the scan).
- the proxy server may receive, from the at least one client module, at least one execution parameter value, associated with a specific interactive computing document.
- the proxy server may inject the received execution parameter value as metadata into the relevant interactive computing document; and further encapsulate the document metadata in the one or more data containers.
- the at least one cloud-based platform may, in turn, be configured to execute the at least one scripting code element of the one or more interactive computing documents according to document metadata.
- the proxy server may be configured to accumulate a subset of scripting code elements, each representing one or more data actions.
- data actions may refer herein to any action that may be performed on data, including for example: accessing data (e.g., reading data writing data, deleting data, etc.) that is stored on a computing device such as a cloud-based platform, importing software modules, running, or executing specific commands or sets of commands on the computing device, and the like.
- the proxy server may classify the subset of scripting code elements, in relation to at least one client module, to one or more categories or classifications (e.g., legitimate action, illegitimate action, and the like). For example, proxy server may classify the subset of scripting code elements in relation to a specific client module that initiated the data action. The proxy server may subsequently determine at least one scripting rule based on said classifications.
- the proxy server may include, or be associated with a feature extraction module, configured to extract one or more data action features from at least one specific scripting code element of the subset of scripting code elements.
- the proxy server may also include, or be associated with an ML based model (e.g., a classification model), configured to classify the specific scripting code element according to the one or more data action classifications, based on the one or more data action features.
- an ML based model e.g., a classification model
- the proxy server may be configured to obtain, from the ML based model, a confidence score, representing pertinence of one or more scripting code elements to the one or more data action classifications; and determine at least one scripting rule based on the confidence score. The proxy server may subsequently enforce the at least one scripting rule on one or more scripting code elements, as elaborated herein.
- the proxy server may be configured to encapsulate the workflow data element and the one or more interactive computing documents by: encapsulating the workflow data element in a first data container; spawning one or more second data containers, each encapsulating one or more interactive computing documents of the one or more interactive computing documents based on the workflow data element; and transmitting the one or more second data containers to the at least one cloud-based platform according to the execution scheme, as defined by the workflow data element.
- the proxy server may include, or be associated with a monitoring module, adapted to monitor execution of the scripting code elements (of the one or more interactive computing documents) by the at least one cloud-based platform.
- the proxy server may also include, or be associated with a rule engine module, which may be adapted to identify a problem in execution according to a set of predefined runtime rules (e.g., upon triggering of a runtime rule) based on said monitoring, as elaborated herein.
- the rule engine module may be configured to communicate a notification message to a relevant client module (e.g., a client module which initiated the relevant scripting code element).
- the notification message may include a representation of the identified execution problem, to be displayed on a user interface (UI) of the user development platform.
- UI user interface
- the rule engine module may be configured to perform at least one mitigation action, according to the identified execution problem (e.g., according to the triggered runtime rule).
- the mitigation action may include, for example blocking a user of a specific client module from accessing data stored on the at least one cloud-based platform, changing an access permission of a role of a user of a specific client module to a specific cloud-based platform, deleting data footprints of the at least one executed scripting code element, disabling a specific user of a specific client module from executing scripting code elements in their user development platform, and the like.
- the proxy server may monitor execution of the scripting code elements by performing run-time monitoring of logs generated by the at least one cloud-based platform; run-time monitoring of an order of execution of the scripting code elements; run-time monitoring of output of the executed scripting code elements; monitoring duration of execution of at least one scripting code element of the one or more interactive computing documents; performing a comparison between output statistics of a current execution of an interactive computing document and output statistics of a previous execution of the interactive computing document; identifying at least one difference between an output of a specific scripting code element at a first execution and an output of the specific scripting code element at a second execution; identifying at least one difference between a first metadata element, associated with a specific scripting code element at a first execution and a second metadata element, associated with the specific scripting code element at a second execution; and any combination thereof.
- the monitoring module may be adapted to monitor execution of the scripting code elements of the one or more interactive computing documents by the at least one cloud-based platform; produce a log data element, representing said monitored execution of the scripting code elements; and communicate the log data element to a relevant client module (e.g., a client module which initiated, or sent the relevant interactive code element).
- the proxy server may send the log data element in real-time or near real time, to facilitating an aspect of interactivity of execution of the interactive code element.
- the client module may, in turn, be configured to display the log data element on a UI of the user development platform.
- Embodiments of the invention may include a system for providing an interactive development environment.
- Embodiments of the system may include a proxy server module, adapted to interface at least one cloud-based platform; and one or more client modules operatively connected to the proxy server, where each client module may be associated with a respective user development platform.
- the at least one client module of the one or more client modules may be configured to receive, from the respective user development platform, one or more interactive computing documents, each representing one or more scripting code elements; and transmit the one or more interactive computing documents to the proxy server.
- the proxy server may be configured to scan the one or more scripting code elements according to a set of predetermined scripting rules; encapsulate the one or more interactive computing documents in one or more data containers, based on said scan; and transmit the one or more data containers to the at least one cloud-based platforms, to execute at least one scripting code element of the one or more interactive computing documents.
- the at least one client module may be configured to produce a workflow data element, defining a scheme for executing the one or more interactive computing documents; and transmit the workflow data element to the to the proxy server.
- the proxy server may be configured to transmit the one or more data containers and the workflow data element to the at least one cloud-based platforms, to execute the one or more interactive computing documents according to the execution scheme, as defined by the workflow data element.
- the proxy server may be configured to encapsulate the workflow data element with the one or more interactive computing documents in the one or more data containers.
- the proxy server may be further configured to, during execution of the one or more interactive computing documents, scan at least one scripting code element according to the set of predetermined scripting rules, to verify at least one of (a) permission-based legitimacy of data retrieval by the at least one scripting code element, and (b) lack of malicious activity in the at least one scripting code element.
- Embodiments of the invention may include a method of providing an interactive development environment.
- Embodiments of the method may include providing a proxy server module, adapted to interface at least one cloud-based platform; providing one or more client modules, operatively connected to the proxy server, where each client module may be associated with a respective user development platform; receiving by at least one client module, from the respective user development platform, one or more interactive computing documents, each representing one or more scripting code elements; scanning the one or more scripting code elements by the proxy server, according to a set of predetermined scripting rules; encapsulating, by the proxy server, the one or more interactive computing documents in one or more data containers, based on said scan; and transmitting, by the proxy server, the one or more data containers to the at least one cloud-based platforms, to execute at least one scripting code element of the one or more interactive computing documents.
- Embodiments of the invention may further include producing, by the at least one client module, a workflow data element, defining a scheme for executing the one or more interactive computing documents; encapsulating, by the proxy server, the workflow data element with the one or more interactive computing documents in the one or more data containers; and transmitting, by the proxy server, the one or more data containers to the at least one cloud-based platforms, to execute the one or more interactive computing documents according to the execution scheme, as defined by the workflow data element.
- FIG. 1 is a schematic diagram, depicting a workflow of generating a machine learning model in a production environment, as known in the art.
- FIG. 2 is a block diagram, depicting a computing device which may be included in a system for providing an interactive development platform in a distributed computing environment, according to some embodiments of the invention
- FIG. 3 is a block diagram, depicting application of a system for providing an interactive development platform in a distributed computing environment, according to some embodiments of the invention
- FIG. 4 is a block diagram, depicting a system for providing an interactive development platform in a distributed computing environment, according to some embodiments of the invention
- FIG. 5 is a flow diagram, depicting a method of providing an interactive development platform in a distributed computing environment, by at least one processor, according to some embodiments of the invention.
- the terms “plurality” and “a plurality” as used herein may include, for example, “multiple” or “two or more”.
- the terms “plurality” or “a plurality” may be used throughout the specification to describe two or more components, devices, elements, units, parameters, or the like.
- the term “set” when used herein may include one or more items.
- FIG. 2 is a block diagram depicting a computing device which may be included in a system for providing an interactive development platform in a distributed computing environment, according to some embodiments of the invention.
- Computing device 1 may include a processor or controller 2 that may be, for example, a central processing unit (CPU) processor, a chip or any suitable computing or computational device, an operating system 3 , a memory 4 , executable code 5 , a storage system 6 , input devices 7 and output devices 8 .
- Processor 2 (or one or more controllers or processors, possibly across multiple units or devices) may be configured to carry out methods described herein, and/or to execute or act as the various modules, units, etc. More than one computing device 1 may be included in, and one or more computing devices 1 may act as the components of, a system according to embodiments of the invention.
- Operating system 3 may be or may include any code segment (e.g., one similar to executable code 5 described herein) designed and/or configured to perform tasks involving coordination, scheduling, arbitration, supervising, controlling or otherwise managing operation of computing device 1 , for example, scheduling execution of software programs or tasks or enabling software programs or other modules or units to communicate.
- Operating system 3 may be a commercial operating system. It will be noted that an operating system 3 may be an optional component, e.g., in some embodiments, a system may include a computing device that does not require or include an operating system 3 .
- Memory 4 may be or may include, for example, a Random-Access Memory (RAM), a read only memory (ROM), a Dynamic RAM (DRAM), a Synchronous DRAM (SD-RAM), a double data rate (DDR) memory chip, a Flash memory, a volatile memory, a non-volatile memory, a cache memory, a buffer, a short term memory unit, a long term memory unit, or other suitable memory units or storage units.
- Memory 4 may be or may include a plurality of possibly different memory units.
- Memory 4 may be a computer or processor non-transitory readable medium, or a computer non-transitory storage medium, e.g., a RAM.
- a non-transitory storage medium such as memory 4 , a hard disk drive, another storage device, etc. may store instructions or code which when executed by a processor may cause the processor to carry out methods as described herein.
- Executable code 5 may be any executable code, e.g., an application, a program, a process, task or script. Executable code 5 may be executed by processor or controller 2 possibly under control of operating system 3 .
- executable code 5 may be an application that may provide an interactive development environment, as further described herein.
- FIG. 2 a system according to some embodiments of the invention may include a plurality of executable code segments similar to executable code 5 that may be loaded into memory 4 and cause processor 2 to carry out methods described herein.
- Storage system 6 may be or may include, for example, a flash memory as known in the art, a memory that is internal to, or embedded in, a micro controller or chip as known in the art, a hard disk drive, a CD-Recordable (CD-R) drive, a Blu-ray disk (BD), a universal serial bus (USB) device or other suitable removable and/or fixed storage unit.
- Data pertaining to an interactive development environment may be stored in storage system 6 and may be loaded from storage system 6 into memory 4 where it may be processed by processor or controller 2 .
- memory 4 may be a non-volatile memory having the storage capacity of storage system 6 . Accordingly, although shown as a separate component, storage system 6 may be embedded or included in memory 4 .
- Input devices 7 may be or may include any suitable input devices, components or systems, e.g., a detachable keyboard or keypad, a mouse and the like.
- Output devices 8 may include one or more (possibly detachable) displays or monitors, speakers and/or any other suitable output devices.
- Any applicable input/output (I/O) devices may be connected to Computing device 1 as shown by blocks 7 and 8 .
- a wired or wireless network interface card (NIC), a universal serial bus (USB) device or external hard drive may be included in input devices 7 and/or output devices 8 . It will be recognized that any suitable number of input devices 7 and output device 8 may be operatively connected to Computing device 1 as shown by blocks 7 and 8 .
- a system may include components such as, but not limited to, a plurality of central processing units (CPU) or any other suitable multi-purpose or specific processors or controllers (e.g., similar to element 2 ), a plurality of input units, a plurality of output units, a plurality of memory units, and a plurality of storage units.
- CPU central processing units
- controllers e.g., similar to element 2
- a neural network (NN) or an artificial neural network (ANN), e.g., a neural network implementing a machine learning (ML) or artificial intelligence (AI) function may refer to an information processing paradigm that may include nodes, referred to as neurons, organized into layers, with links between the neurons. The links may transfer signals between neurons and may be associated with weights.
- a NN may be configured or trained for a specific task, e.g., pattern recognition or classification. Training a NN for the specific task may involve adjusting these weights based on examples.
- Each neuron of an intermediate or last layer may receive an input signal, e.g., a weighted sum of output signals from other neurons, and may process the input signal using a linear or nonlinear function (e.g., an activation function).
- the results of the input and intermediate layers may be transferred to other neurons and the results of the output layer may be provided as the output of the NN.
- the neurons and links within a NN are represented by mathematical constructs, such as activation functions and matrices of data elements and weights.
- a processor e.g., CPUs or graphics processing units (GPUs), or a dedicated hardware device may perform the relevant calculations.
- FIG. 3 depicts an example of an application of a system 10 for providing an interactive development platform in a distributed computing environment 50 , according to some embodiments of the invention.
- system 10 may be, or may include a client-server platform, configured to manage access and processing of production data, via an interactive development environment, in an optimized and secure manner.
- system 10 may be implemented as a software module, a hardware module, or any combination thereof.
- system 10 may be or may include a computing device such as element 1 of FIG. 2 , and may be adapted to execute one or more modules of executable code (e.g., element 5 of FIG. 2 ) to manage access and processing of production data, as further described herein.
- modules of executable code e.g., element 5 of FIG. 2
- arrows may represent flow of one or more data elements to and from system 10 and/or among modules or elements of system 100 . Some arrows have been omitted in FIG. 3 for the purpose of clarity.
- system 10 may include one or more client modules 200 that may be integrated into, or may collaborate with a development environment 30 such as Amazon Sagemaker notebook. Additionally, system 10 may include at least one proxy server module 100 that may be integrated into, or may collaborate with a plurality of cloud computing platforms 510 such as Amazon EMR, Google cloud and Microsoft Azure, and distributed processing systems 520 such as Apache spark.
- cloud computing platforms 510 such as Amazon EMR, Google cloud and Microsoft Azure
- distributed processing systems 520 such as Apache spark.
- the at least one proxy server module 100 and the one or more client modules 200 may be implemented on the same computing device (e.g., element computing device 1 of FIG. 2 ). Alternatively, the at least one proxy server module 100 and the one or more client modules 200 may be implemented on separate computing devices 1 , and may be communicatively connected via a computer communication network, such as the Internet.
- a computer communication network such as the Internet.
- cloud computing platforms 510 and/or distributed processing systems 520 may facilitate access to a large variety of cloud based services 530 , including for example cloud storage services and resources 531 such as Amazon S3, cloud analytics databases 532 such as machine-learning (ML) or Business Intelligence (BI) analytics engines (e.g., Amazon Redshift), and data integration services 533 such as Amazon glue.
- cloud storage services and resources 531 such as Amazon S3, cloud analytics databases 532 such as machine-learning (ML) or Business Intelligence (BI) analytics engines (e.g., Amazon Redshift), and data integration services 533 such as Amazon glue.
- ML machine-learning
- BI Business Intelligence
- system 10 may provide centralized hub for following, detecting, and/or preventing data access to various cloud platforms and data providers.
- System 10 may identify violations of predefined access rules or privileges, and produce recommendations for implementation of actions upon identification of such conditions, to prevent compromise of production data.
- system 10 may collaborate, or be integrated into interactive development environments such as Jupyterlab, Zeppelin, and the like.
- interactive may be used in this context to indicate that development environment 30 may enable a user (e.g., a developer) to interactively or iteratively analyze data and see the results of their analysis (e.g., in the form of datasets, graphic displays, and the like), during, or as part of the development process.
- interactive computing documents may be used herein to refer to data elements such as Jupyter notebooks, that may be employed by system 10 to facilitate this quality of interactivity during, or following the process of development, as elaborated herein.
- system 10 may include an optimization component 10 A, adapted to implement a data access optimization process.
- the system optimization component may gather historical statistic data from previous executions of scripts in an interactive development environment 30 such as Jupyter Notebook.
- historical statistic data may include, for example metadata representing processing time of scripting code elements (SCEs) such as Notebook cells, an amount (e.g., in data bytes) of data that had been accessed (e.g., read access, write access) by the SCEs, a size of datasets resulting from execution of the SCEs, one or more types or properties of execution solution (e.g., computing devices or platforms used for processing the SCEs), and the like.
- SCEs scripting code elements
- the optimization component may subsequently, and automatically choose a configuration that provides optimal execution of target SCEs in relation to predefined metrics, such as cost metrics, throughput metrics, speed metrics and the like.
- the optimization component may automatically select, and/or configure a cloud computing platform 510 such as Amazon EMR, Google Cloud or Microsoft Azure, to provide optimal execution of one or more target SCEs (e.g., cells) in an interactive development environment 30 such as Jupyter Notebook.
- a cloud computing platform 510 such as Amazon EMR, Google Cloud or Microsoft Azure
- the optimization component may automatically select, and/or configure a cloud solution such a cloud based service 530 (e.g., cloud storage services 531 , cloud analytics databases 532 and/or data integration services 533 ), to provide an optimal execution of the one or more target SCEs.
- a cloud based service 530 e.g., cloud storage services 531 , cloud analytics databases 532 and/or data integration services 533 .
- the optimization component may automatically select or configure a number, a type, and/or one or more properties of computing devices of distributed processing systems 520 .
- properties may include, for example a size of memory associated with the distributed processing systems 520 , a number and/or a type of processing cores associated with the distributed processing systems 520 , and the like.
- proxy server 100 may by operatively connected to one or more client modules 200 , each associated with, or included in a respective user development platform 30 .
- development platform 30 may be, or may include a platform such as Amazon's Sagemaker, where a user or developer may draft or write one or more interactive computing documents (ICDs) 30 A, commonly referred to as notebook documents (or “notebooks”, for short).
- ICDs interactive computing documents
- notebooks for short
- each such notebook document 30 A may include one or more scripting code elements (SCEs) 30 B, commonly referred to as “cells”.
- SCEs scripting code elements
- each client module 200 may associated with a respective development platform 30 .
- client module 200 may be implemented as a software module or an add-on that may interface, or communicate with a respective development platform 30 (e.g., the Sagemaker platform).
- Client module 200 may also operatively connect to, or communicate with one or more proxy servers 100 on the other hand, to provide capabilities of system 10 as elaborated herein.
- development platform 30 and client module 200 may be installed, or included on the same computing device, such as computing device 1 of FIG. 2 .
- development platform 30 and client module 200 may be implemented on separate computing devices 1 , and may be communicatively connected via a computer network such as the Internet.
- client 200 may receive, from development platform 30 one or more ICDs (e.g., notebooks) 30 A, that may include one or more SCEs (e.g., cells) 30 B. Client 200 may transmit the one or more received ICDs 30 A to proxy server 100 , for example as a data action, such as a data access request 200 B.
- ICDs e.g., notebooks
- SCEs e.g., cells
- Proxy server 100 may, in turn receive the one or more data actions (e.g., data access requests) 200 B (e.g., ICD 30 A) from at least one client module 200 , and may communicate with distributed computing environment 50 to execute, or process the one or more data actions (e.g., data access requests 200 B) (e.g., ICDs 30 A) on cloud computing platforms 510 of the distributed computing environment 50 .
- data actions e.g., data access requests
- 200 B e.g., ICD 30 A
- proxy server 100 and client 200 may jointly implement an optimization aspect or component 10 A of system 10 , to improve aspects of efficiency in execution of one or more cells 30 B.
- proxy server 100 may implement a caching system, to improve throughput or latency of execution.
- proxy server 100 may be configured to maintain (e.g., in storage 6 of FIG. 2 ) historical results of executed cell 30 B processes.
- Proxy server 100 may receive from one or more clients 200 one or more new cells 30 B, to be executed or processed in distributed computing environment 50 .
- Proxy server 100 may automatically select whether to return previously retrieved results from cache (e.g., from storage 6 ), or communicate the newly received cells to distributed computing environment 50 , in order to re-run the processing.
- client 200 may be configured to scan a source code in one or more notebooks 30 A or cells 30 B received from interactive development environment 30 according to a predetermined set of rules, to find failures in the source code.
- Client 200 may transmit a notification message 200 A to server 100 , that may include information regarding outcome of the scan.
- Such information may include, for example, whether any failures in the source code were found and/or whether the server is required to perform further analysis to decide whether any optimization actions are required.
- failure of execution of source code in a cell 30 B may be caused due to timeout of processing or lack of memory in a cloud computing platforms 510 .
- proxy server 100 may determine that an upgrade (e.g., addition of memory space and or additional allocation of processors) of cloud computing platforms 510 is required.
- client 200 may be configured to scan a source code of cells 30 B to identify indications in the source code that require usage of specific, or special computing devices. For example, client 200 may identify code snippets that make use of “GPU” or “CUDA” indications, or require usage of machine-learning libraries or APIs such as TensorFlow or Keras. Client 200 may transmit an indication 200 A of such findings to proxy server 100 for further analysis, to decide whether a GPU or TPU computing device should be used for processing or retrieving source code of the scanned cell 30 B.
- proxy server 100 and client 200 may jointly implement a security aspect or component 10 B of system 10 , to improve aspects of data security prior to, and during execution of one or more cells 30 B.
- proxy server 100 and/or client 200 may be configured to perform a preemptive scan of one or more SCEs of the interactive development environment 30 (e.g., source code included in notebooks 30 A and/or cells 30 B) according to one or more (e.g., a set of) predetermined scripting rules 100 D.
- preemptive may be used herein to indicate a scan that may be performed prior to communicating the cells' 30 B source code to distributed computing environment 50 for processing.
- proxy server 100 may be configured to scan source code of one or more cells 30 B based on the predetermined scripting rules 100 D, to find utilization of elements such as commands, actions, API calls and import objects (e.g., libraries such as Pandas, AWS Data wrangler or direct spark commands) that may be used to retrieve data from distributed computing environment 50 in a manner that circumvents, or is not performed via system 10 .
- Proxy server 100 may log such occurrences, and/or implement preemptive measures or actions to mitigate or avoid risk of data exfiltration.
- proxy server 100 may be configured to block access of a specific entity (e.g., a specific client 200 , a specific user, a specific account or role, a specific cell or notebook, and the like) to relevant elements (e.g., cloud services 530 , cloud computing platforms or clusters 510 , datasets and the like) of distributed computing environment 50 .
- a specific entity e.g., a specific client 200 , a specific user, a specific account or role, a specific cell or notebook, and the like
- relevant elements e.g., cloud services 530 , cloud computing platforms or clusters 510 , datasets and the like
- proxy server 100 and/or client 200 may be configured to perform a runtime scan of one or more SCEs of the interactive development environment 30 .
- runtime may be used herein to indicate a scan that may be performed in conjunction to running, or execution of source code in one or more cells 30 B or notebooks 30 A.
- proxy server 100 may track manipulation and/or diversion of data (or a data-set, as commonly referred to in the art) retrieved from distributed computing environment 50 .
- proxy server 100 may hold a copy 100 A of one or more relevant, retrieved elements of the data-set. Additionally, or alternatively, proxy server 100 may maintain a runtime log 100 B of actions that have been performed on data elements in the data-set. Runtime log 100 B may include, for example, code lines of executed data manipulations, timestamps of executed data manipulations, results or outputs of executed data manipulations, relevant object names, relevant file names, relevant database tables and the like.
- proxy server 100 may be configured to continuously (e.g., repeatedly, over time) perform runtime analysis of the copy 100 A of one or more elements of the data-set (referred to herein as “copy 100 A”) and/or runtime log 100 B.
- proxy server 100 may be configured to find relevant data elements and corresponding relevant commands, that may be regarded as suspicious, according to a predetermined set of runtime rules 100 C.
- a runtime rule 100 C may be implemented as a data structure (e.g., a table), that may associate at least one element in distributed computing environment 50 (e.g., a portion of a dataset) with at least one rule or restriction to be imposed upon that element.
- a runtime rule 100 C may dictate that at least one data element of a dataset stored by a cloud storage service 531 of distributed computing environment 50 may be prohibited from being stored elsewhere (e.g., locally), on a storage device (e.g., element 6 of FIG. 2 ) associated with a computing device 1 where development environment 30 is deployed. It may be appreciated that additional runtime rules 100 C may also be generated, or used, depending on specific characterization of proxy server 100 .
- Proxy server 100 may be configured to identify an execution problem as a condition in which runtime rule 100 C has been triggered. Pertaining to the example above, an identified execution problem may be a condition in which an SCE 30 B has attempted to store the relevant data element in a prohibited location.
- proxy server 100 and/or client 200 may be configured to enforce the runtime rules, to mitigate risks to data security, based on the runtime analysis (e.g., based on identification of an execution problem, or trigger of a runtime rule 100 C).
- proxy server 100 may be configured to communicate a mitigation action 100 G as an alert message 100 F (e.g., a revocation alert message 100 F) to client 200 , which may be configured to perform a mitigation action 100 G (e.g., revocation action) in the development environment, based on message 100 F, so as to delete the relevant data copies and protect against data exfiltration.
- a mitigation action 100 G e.g., revocation action
- system 10 may serve as a virtualization layer, enabling users of development environment 30 to configure, and interface distributed computing environment 50 of various types and configurations.
- system 10 may include a development environment integrated library that may facilitate a unified API 100 E to access a multitude of types of cloud-based services and/or data elements via proxy server 100 .
- Unified API 100 E may be adapted to enable a streamline user definition of computing platforms and clusters, management of cloud-based datasets, and execution of big-data code scripts in a manner that is oblivious to the specific underlying cloud-based environment technology.
- system 10 may utilize unified API 100 E to enable user data access and roles' management, so as to allow access to specific data, and/or run specific data manipulation scripts, while monitoring one or more (e.g., every) incident of data access. Based on this monitoring, system 10 may determine maliciousness of incidents of data access, and optionally block access in case of malicious or unauthorized access is detected, as elaborated herein. Additionally, system 10 may be configured to revoke access and/or automatically delete any data leftovers, such as development environment variables that may be stored or cached in one or more computing devices 1 associated with development environment 30 .
- system 10 may be configured to provide real-time alerts to any predetermined endpoint, such as a predefined email, a text message (e.g., a short messaging service (SMS) message) associated with a specific person, a chat-based service, associated with a specific account, and the like.
- a predefined email such as a predefined email, a text message (e.g., a short messaging service (SMS) message) associated with a specific person, a chat-based service, associated with a specific account, and the like.
- SMS short messaging service
- system 10 may identify an unauthorized access to a cloud-based element (e.g., a data element stored by a cloud storage service 531 of distributed computing environment 50 ). Such identification may be based, for example, on (b) automatic detection of an anomalous incident of data access; (b) a rule-based monitoring of data access; (c) revocation (manual or automatic) of a policy, a role, and/or a dataset; (d) expiration of a timeframe for accessing the relevant, specific dataset, and the like.
- server 100 may generate an alert notification 100 F, and may send notification 100 F e.g., as an email message to one or more relevant accounts.
- notification 100 F may include a command, such as a command to perform a mitigation action 100 G (e.g., delete a specific data element).
- Notification 100 F may be sent to one or more (e.g., all) clients 200 that are associated with relevant development environments 30 .
- These relevant development environments 30 may be development environments 30 that hold the relevant data, or that are associated with other memory devices, data storage devices, caches, files etc., that were used to save the relevant data.
- Clients 200 may subsequently proceed to actively delete or purge the relevant data elements as dictated by notification 100 F.
- FIG. 4 is a block diagram, depicting another example of a system 10 for providing an interactive development platform in a distributed computing environment 50 , according to some embodiments of the invention. It may be appreciated that system 10 of FIG. 4 may be the same as system 10 of FIG. 3 . Some elements of FIG. 3 have been omitted in FIG. 4 for the purpose of clarity.
- At least one client 200 may include a code run module 210 , adapted to receive, from a respective development environment 30 one or more ICDs (e.g., notebooks) 30 A, each including one or more SCEs (e.g., cells) 30 B.
- ICDs e.g., notebooks
- SCEs e.g., cells
- Code run module 210 may be configured to process the received SCEs 30 B to extract one or more entities, intended for obtaining a specific data set. Such entities may include, for example, commands, SQL syntax queries, source code elements (e.g., spark or map-reduce code, notebook .pynb files, etc.) and the like. Code run module 210 may be configured to utilize unified API 100 E to convert the one or more entities it to a relevant format, and send the converted ICDs (e.g., notebooks) 30 A and/or SCEs (e.g., cells) 30 B to proxy server 100 .
- ICDs e.g., notebooks
- SCEs e.g., cells
- Proxy server 100 may in turn communicate the converted notebooks 30 A and/or cells 30 B to distributed computing environment 50 , so as to perform data actions (e.g., data access actions, such as write data into, or delete data from a specific, allowed location on a specific cloud-based service 530 ).
- data actions e.g., data access actions, such as write data into, or delete data from a specific, allowed location on a specific cloud-based service 530 ).
- code run module 210 may be configured to produce at least one ICD (e.g., a notebook) 30 A that may be referred to herein as a workflow data element 210 A (or “workflow” 210 A for short).
- Workflow data element 210 A may be, or may include a data structure (e.g., a table) that may associate between execution properties of one or more notebooks 30 A and/or cells 30 B.
- workflow 210 A may define a scheme for executing the one or more ICDs 30 A and/or SCEs 30 B.
- workflow data element 210 A may define a scheme for executing one or more ICDs 30 A in a sense of scheduling of execution of one or more notebooks 30 A or cells 30 B.
- scheduling may include, for example setting a trigger (e.g., a timestamp, an interrupt, etc.) for executing the one or more notebooks 30 A or cells 30 B by a cloud computing platform 510 of distributed computing environment 50 .
- a trigger e.g., a timestamp, an interrupt, etc.
- workflow data element 210 A may define a scheme for executing one or more ICDs 30 A in a sense of setting an order of execution of one or more notebooks 30 A or cells 30 B.
- ordering may include, for example setting a first notebook 30 A to be executed by a cloud computing platform 510 , and setting a second notebook 30 A to be executed by cloud computing platform 510 only after execution of the first notebook 30 A is completed.
- workflow data element 210 A may define a scheme for executing one or more ICDs 30 A in a sense of setting a logical condition for executing the one or more notebooks 30 A or cells 30 B (e.g., execute a first notebooks 30 A only if a first condition is met, and execute a second notebooks 30 A only if a second condition is met).
- workflow data element 210 A may define a scheme for executing one or more ICDs 30 A in a sense of setting up a condition of concurrence of execution of two or more notebooks 30 A or cells 30 B. For example, setting up a plurality of cells to be executed in parallel (e.g., substantially at the same time), e.g., by using a plurality of processors of cloud computing platform 510 .
- workflow data element 210 A may be received, e.g., from a user or another computing device (e.g., via input 7 of FIG. 2 ). Additionally, or alternatively, code run module 210 may receive (e.g., as a script, via a user interface, and the like) one or more parameters defining characteristics or properties of a workflow data element 210 A (e.g., ordering of notebook execution, concurrence of execution, condition of concurrence of execution, timing of notebook execution, and the like), and may produce workflow data element 210 A so as to define a scheme for executing one or more ICDs 30 A, based on the one or more received parameters.
- code run module 210 may receive (e.g., as a script, via a user interface, and the like) one or more parameters defining characteristics or properties of a workflow data element 210 A (e.g., ordering of notebook execution, concurrence of execution, condition of concurrence of execution, timing of notebook execution, and the like), and may produce workflow data element 210 A so as to define a scheme for
- code run module 210 may be configured to transmit workflow data element 210 A to proxy server 100 , which may in turn communicate workflow 210 A to distributed computing environment 50 , in order to execute the one or more notebooks 30 A or cells 30 B according to the execution scheme defined by workflow 210 A.
- At least one client 200 may include a dataflow monitoring module 220 , configured to extract relevant entities of the received notebooks 30 A or cells 30 B.
- Such extracted entities may include, for example variables, file-names, local-database-actions, names, cache-engines and/or cookies used in the development environment that include, or are a copy or an output of a dataset retrieved from distributed computing environment 50 .
- the extracted entities may include parameters that are included in cells 30 B, that may be utilized to determine legitimacy of data actions (e.g., data access).
- dataflow monitoring module 220 may be configured to log every code snippet that the dataset name or a copy that was identified as a copy is mentioned in, and send the logs to proxy server 100 , while keeping a local encrypted log that may be used by a data revocation module 230 , as elaborated herein.
- At least one client 200 may include a data revocation module 230 , adapted to receive an alert or a message (e.g., element 100 F of FIG. 3 ) of revocation from proxy server 100 , and get the relevant objects (e.g., variables, file stores, caches) from dataflow monitoring module 220 .
- Data revocation module 230 may subsequently delete the relevant data according to message 100 F.
- At least one client 200 may include a notification module 240 .
- data revocation module 230 may be configured to communicate with notification module 240 , to notify the results of a revocation to relevant users.
- Notification module 240 may, in turn be configured to transmit a notification 240 A to at least one user regarding occurrence of data revocation, and relevant entities (e.g., variables, files, caches that have been affected).
- the notification may be transmitted, for example as a text message to an email account or messaging account of one or more relevant users.
- notification module 240 may transmit notification 240 A to development environment 30 , to be presented to a user.
- At least one client 200 may include a code tracking module 250 , configured to scan source code that is included in one or more ICDs 30 A (e.g., notebooks 30 A) or SCEs 30 B (e.g., cells 30 B) of development environment 30 .
- Code tracking module 250 may scan the source code continuously (e.g., periodically, or repeatedly over time). Alternatively, code tracking module 250 may scan the source code by demand, or after every instance of code execution.
- code tracking module 250 may look for specific code executions results, failures or snippets that contain or match a list of terms. This list of terms may, for example be retrieved from proxy server 100 or from a database 80 .
- code tracking module 250 may log each finding in according to a relevant log category.
- the log categories may include, for example specific instances of source-code execution, specific data-sets, specific data risks, and the like.
- code tracking module 250 may scan the development environment's 30 source code (e.g., source code in ICDs 30 A and/or SCEs 30 B) to find instances of code that are adapted to attempt to access distributed computing environment 50 while circumventing proxy server 100 . Such access attempts may utilize, for example commands, action APIs and import-objects to retrieve data from stored by cloud-based services 530 . Code tracking module 250 may log finding of such instances, and may transmit these findings to proxy server 100 in order log, notify and/or block relevant access or optimize a current or a subsequent execution after “learning” the relevant terms and code-snippets.
- source code e.g., source code in ICDs 30 A and/or SCEs 30 B
- Such access attempts may utilize, for example commands, action APIs and import-objects to retrieve data from stored by cloud-based services 530 .
- Code tracking module 250 may log finding of such instances, and may transmit these findings to proxy server 100 in order log, notify and/or block relevant access or optimize a current or a subsequent execution after
- code tracking module 250 may collaborate with data revocation module 230 to delete or purge data footprint of at least one SCE 30 B.
- code tracking module 250 may record a lineage of at least one data element or data frame of SCE 30 B.
- Such lineage may include, or represent a history of locations and/or links to locations in which at least one version of the at least one data element of SCE 30 B was stored or kept.
- code tracking module 250 may record copying of a first data frame into a second data frame, renaming (e.g., changing a filename) or reformatting (e.g., changing a file type) of a first data frame into one or more second data frames, inclusion of a first data frame (e.g., a data structure such as a table) in one or more second data frames (in a comma separated vector (CSV) file), uploading or saving the first data frame (e.g., onto a cloud storage service), and the like.
- a first data frame e.g., a data structure such as a table
- CSV comma separated vector
- Code tracking module 250 may subsequently be invoked by data revocation module 230 as part of a mitigation action, to delete or purge a footprint of the first data frame or data element. Pertaining to the same example, code tracking module 250 may delete the footprint of the first data frame or data element by deleting, or disabling access to copies and/or versions of the first data frame and/or the one or more second data frames.
- code tracking module 250 may be configured to perform run-time analysis of execution of one or more SCEs (e.g., cells) 30 B of one or more ICDs 30 A (e.g., notebooks).
- SCEs e.g., cells
- ICDs 30 A e.g., notebooks
- code tracking module 250 may continuously communicate with proxy server(s) 100 to monitor output logs produced by computing environment 50 , and extract therefrom processing statistics such as processing time (e.g., duration of execution) of SCEs (e.g., cells) 30 B.
- processing time e.g., duration of execution
- Code tracking module 250 may then analyze the processing statistics, for example by comparing the processing statistics (e.g., execution time) to historical processing statistics of previous executions, as elaborated herein.
- At least one proxy server 100 may include a data request analysis module 110 , adapted to analyze commands and features received from a client's 200 code run module 210 . Additionally, data request analysis module 110 may be configured to scan files that may be sent by one or more clients 200 as part of command parameters.
- data request analysis module 110 may extract one or more features, such as dataset names and terms from the received command, and compare the one or more extracted features to historical similar commands, results and/or processing-periods and/or features associated specific users such as roles, policies, countries, Internet Protocol (IP) addresses and the like. Based on this comparison, data request analysis module 110 may check for abnormal behavior.
- features such as dataset names and terms from the received command
- IP Internet Protocol
- data request analysis module 110 may receive (e.g., from input 7 of FIG. 2 ) set of predetermined scripting rules 100 D and/or runtime rules 100 C. Additionally, or alternatively, and as elaborated herein, data request analysis module 110 may automatically generate a set of scripting rules 100 D and/or runtime rules 100 C using a machine-learning (ML) model 113 based on data from a data enhancement module 150 and historical logs.
- ML machine-learning
- client 200 may upload ICD 30 A to proxy server 100 to be scanned.
- Proxy server 100 may analyze the received ICD 30 A code elements 30 B in view of a repository or database 80 of suspicious or malicious actions (denoted in FIG. 4 as database 80 ).
- repository 80 may include a static repository 80 A, that may include or store one or more definitions of actions that are identified as suspicious or malicious (e.g., from third-party databases or online data sources (e.g., GitHub)).
- database 80 may include a dynamic repository 80 B, that may store, or include definitions of actions that are classified as suspicious or malicious by ML model 113 , in a continuous effort to identify such actions or code lines in SCEs 30 B.
- ML model 113 may be configured to learn a pattern of history of actions of specific users and/or groups of users, as represented by code lines or CSEs 30 B (e.g., cells) in notebooks 30 A, to identify anomalous actions, and mitigate potential threats derived from such actions.
- CSEs 30 B e.g., cells
- ML model 113 may be configured to accumulate a subset of SCEs 30 B, each representing one or more data actions.
- the one or more data actions may include, for example actions of accessing data that is stored on the at least one cloud-based platform 50 , actions of importing software modules, actions of running or executing specific commands or sets of commands, and the like.
- ML model 113 may, in turn, classify the subset of scripting code elements (e.g., in relation to at least one client module 200 ), to one or more data action classifications or categories.
- ML model 113 may be configured to classify the specific type of SCE 30 B as a legitimate action for that user and/or client 200 (e.g., the corresponding action classification may be “legitimate data access”).
- ML model 113 may be configured to classify the specific type of SCE 30 B as an illegitimate action for that user and/or client 200 . (e.g., the corresponding action classification may be “illegitimate data access”).
- ML model 113 may determine at least one scripting rule 100 D or metadata rule 100 D′, that may associate the relevant SCE 30 B or action with a specific mitigation action or message 100 F.
- data request analysis module 110 may include a feature extraction module 112 , configured to extract one or more action features 112 A from at least one specific SCE 30 B.
- the one or more features 112 A may include, for example features of data access actions, representing patterns of accessing data on a distributed computing platform 50 .
- Feature extraction module 112 may introduce the one or more extracted features 112 A to ML model 113 , which may classify the specific SCE 30 B according to one or more action classifications (e.g., data access action classifications), based on the one or more features 112 A (e.g., the data action features 112 A).
- action classifications e.g., data access action classifications
- data action features 112 A may include data access features such as a frequency of accessing data by one or more specific users, historical success, or failure of accessing data by the specific users, latency of data retrieval and the like.
- ML model 113 may receive data action features 112 A (e.g., data access features 112 A) pertaining to a plurality of SCEs 30 B, and may be trained to predict or classify corresponding CSEs as including legitimate actions (e.g., legitimate data access requests) or illegitimate actions (e.g., illegitimate data access requests).
- ML model 113 may receive action features 112 A (e.g., data access features 112 A) pertaining to a specific target SCE 30 B, and classify the target SCE 30 B as including legitimate or illegitimate actions (e.g., legitimate, or illegitimate data access requests), according to the one or more action features 112 A (e.g., data access features 112 A) based on the training.
- action features 112 A e.g., data access features 112 A
- ML model 113 may receive action features 112 A (e.g., data access features 112 A) pertaining to a specific target SCE 30 B, and classify the target SCE 30 B as including legitimate or illegitimate actions (e.g., legitimate, or illegitimate data access requests), according to the one or more action features 112 A (e.g., data access features 112 A) based on the training.
- proxy server 100 may be configured to verify, during run-time or just preceding run-time, at least one of (a) permission-based legitimacy of data retrieval by the at least one scanned SCE 30 B, and (b) lack of malicious activity in the at least one scanned SCE 30 B.
- ML model 113 may produce a confidence score, representing pertinence of one or more SCEs 30 B to the one or more classifications.
- ML model 113 may provide a confidence score representing the extent of which a specific SCE 30 B pertains to a specific action classification (e.g., legitimate, or illegitimate data access requests).
- Data request analysis module 110 may subsequently determine at least one scripting rule 100 D based on the confidence score.
- scripting rule 100 D may associate the SCE 30 B with a first (strong) mitigation action 100 G, such as denial of execution and/or revoking of data.
- scripting rule 100 D may associate the SCE 30 B with a second (weak) mitigation action 100 G, such as sending a notification 100 F to an administrator's email.
- ML model 113 may cluster SCEs 30 B (or command lines, included in SCEs 30 B) based on previously received SCEs 30 B, and/or SCEs 30 B that are already included in database or repository 80 .
- ML model 113 may cluster SCEs 30 B into one or more clusters (denoted 113 ′) to form a clustering model (also denoted 113 ′), and then classify one or more incoming target SCE 30 B as legitimate, or illegitimate based on the clustering model 113 ′.
- clusters and “cluster model” may be used interchangeably in this context.
- data request analysis module 110 may calculate a similarity score, or similarity metric value 117 for one or more code lines of the SCEs 30 B to corresponding code lines in repository 80 .
- a first code line of a first SCE 30 B such as “% run -t mynb.ipynb” may have high similarity (e.g., a high similarity metric value 117 ) to a second code line such “%run” that may be stored in repository 80 as a legitimate code line.
- Data request analysis 110 may apply a similarity threshold to associate these two code lines in the same cluster 113 ′ of the clustering model 113 ′, and may subsequently classify the code line “%run -t mynb.ipynb” as legitimate, based on this clustering 113 ′.
- data request analysis 110 may be configured to cluster SCEs 30 B into clusters 113 ′ that include, or pertain to similar actions.
- Such actions may include, for example importing of modules, “pandas” data frame actions, special notebook actions implemented in notebook kernels (commonly referred to as “magic” actions), data retrieval actions and the like.
- data request analysis 110 may calculate a frequency metric value, representing a frequency of appearance of terms or keywords in code lines of SCE 30 B. Additionally, or alternatively, data request analysis 110 may calculate a similarity metric value 117 , representing similarity between two or more of the terms or keywords of SCE 30 B.
- ML model 113 may subsequently cluster the terms or keywords based on the calculated frequency metric value and/or the similarity metric value 117 to produce a cluster model (denoted 113 ′), that may be or may include a plurality of clusters (also denoted 113 ′). Each cluster 113 ′ of cluster model 113 ′ may represent terms that are close in meaning and/or utilization.
- data request analysis 110 may calculate or generate sub clusters (denoted 113 ′′), to indicate finer accuracy granularity of similar actions, using a similarity algorithm.
- a cluster 113 ′ of import actions may include a first sub-cluster 113 ′′ that pertains to importing “pandas” elements, and a second sub-cluster 113 ′′ that pertains to importing “JSON” elements.
- data request analysis 110 may calculate a ranking score 116 , representing legitimacy of code lines or SCEs 30 B included in that cluster 113 ′ or sub-cluster 113 ′′.
- data request analysis 110 may calculate the ranking score 116 as a function of a frequency (e.g., the number of times) that a relevant SCE 30 B (which may also be referred to herein as a “code line”, “term”, or “cell”) appears in a repository of client 200 and/or in third-party, online repositories.
- a frequency e.g., the number of times
- a relevant SCE 30 B which may also be referred to herein as a “code line”, “term”, or “cell”
- the ranking score 116 may be calculated as a function of the time (e.g., days) that has elapsed since a previous appearance of an SCE 30 B of the same cluster 113 ′ in the scanned notebooks 30 A.
- a ranking score 116 (denoted RankScore(i)) of a specific SCE 30 B (denoted T) may be calculated according to equation Eq. 1, below:
- RankScore ⁇ ( i ) ( TotalTermFreq ⁇ ( i ) ⁇ ActionTime ⁇ ( i ) ) Max all ⁇ clusters [ AllTerms ⁇ ( cluster ) ] ⁇ Max 0 ⁇ i ⁇ R [ ActionTime ⁇ ( i ) ] ⁇ [ 1 log ⁇ ( dDays ⁇ ( i ) + 1 ) ] Eq . 1
- TotalTermFreq(i) may be the number of times that a specific term or sub-term “e.g.: import pandas” was found (e.g., counted) in the customer historical notebooks 30 A or code-scripts;
- ActionTime(i) may be the time (e.g., in seconds) that the action took to perform (e.g., for code actions);
- Max all clusters [AllTerms(cluster)] may be the maximum amount of time that a term or a sub-term was found (e.g., counted) for this specific notebook 30 A in historical runs (e.g., all historically recorded executions);
- Max 0 ⁇ i ⁇ R [ActionTime(i)] may be the maximum amount of time (e.g., in seconds) that a term or sub-term took for this specific notebook 30 A in historical runs (e.g., all historically recorded executions);
- log (dDays(i)+1) is a logarithm on the number of (days +1), that have
- data request analysis module 110 may produce one or more run-time rules 100 C and/or one or more scripting rules 100 D, based on the clustering 113 ′.
- Run-time rules 100 C and/or scripting rules 100 D may be stored in repository 80 , and may be used for continuous preemptive, and run-time analysis of ICDs 30 A and SCEs 30 B.
- ML model 113 may classify or define a cluster 113 ′ of actions as representing illegitimate actions (e.g., actions that include transferring of data to an untrusted location).
- Data request analysis module 110 may create a scripting rule 100 D, and associate scripting rule 100 D with the relevant cluster 113 ′.
- Scripting rule 100 D may be regarded as a rule in a sense that it may associate an identified condition (e.g., appearance of an illegitimate data transferal action), as defined by the relevant cluster 113 ′, with a mitigating action.
- mitigating action may include disallowing of execution of the illegitimate actions that are included in the relevant cluster 113 ′.
- data request analysis module 110 may be, or may include a monitoring module, adapted to monitor execution of SCEs 30 B of the one or more ICDs 30 A by the at least one cloud-based platform 50 .
- data request analysis module 110 may include a rule engine module 115 , adapted to identify, during run-time or subsequent to execution, a problem in execution of SCE 30 B, according to a set of predefined runtime rules 100 C, based on the monitoring.
- rule engine module 115 adapted to identify, during run-time or subsequent to execution, a problem in execution of SCE 30 B, according to a set of predefined runtime rules 100 C, based on the monitoring.
- data request analysis module 110 may perform run-time monitoring of logs generated by the at least one cloud-based platform. In another example, data request analysis module 110 may perform run-time monitoring of an order of execution of the scripting code elements. In another example, data request analysis module 110 may perform run-time monitoring of output of the executed scripting code elements. In another example, data request analysis module 110 may monitor execution statistics such as duration of execution of at least one scripting code element of the one or more interactive computing documents.
- data request analysis module 110 may perform a comparison between output statistics of a current execution of an interactive computing document and output statistics of a previous execution of the interactive computing document.
- data request analysis module 110 may be configured to identify at least one difference between an output of a specific SCE 30 B at a first execution and an output of the specific SCE 30 A (e.g., the same SCE 30 B, or another SCE 30 B containing similar code lines) at a second execution.
- data request analysis module 110 may be configured to identify at least one difference between a first metadata element (e.g., duration of execution, memory footprint, power consumption, etc.), associated with a specific scripting code element at a first execution, and a second, corresponding metadata element, associated with the specific scripting code element at a second execution.
- a first metadata element e.g., duration of execution, memory footprint, power consumption, etc.
- data request analysis module 110 may monitor run-time statistics and/or output of executed ICDs (cells) 30 A.
- data request analysis module 110 may record the processing time each cells, such as code cells, markup (HTML) cells, and/or raw cells, as commonly referred to in the art.
- Rule engine module 115 of Data request analysis module 110 may produce one or more run-time rules 100 C pertaining to specific features or characteristics of the execution statistics and/or output of the execution. Pertaining to the same example, rule engine module 115 may produce a rule 110 C that associates between a detected abnormal behaviour (e.g., when execution exceeds a predetermined threshold period) and a mitigation action (e.g., revocation of data).
- a detected abnormal behaviour e.g., when execution exceeds a predetermined threshold period
- a mitigation action e.g., revocation of data
- Data request analysis module 110 may analyze the one or more extracted action features to identify a suspicious or abnormal actions, such as an abnormal attempt to access data stored on computing environment 50 . In a condition that such an abnormal or suspicious data action request 200 B is identified, analysis module 110 may block the relevant data action request 200 B (e.g., from accessing , reading, writing, deleting, etc. relevant data on computing environment 50 ).
- data request analysis module 110 may produce a message or log that includes an indication of the identified abnormal data action (e.g., abnormal data access). Data request analysis module 110 may transmit this message to be stored on database 80 . Additionally, or alternatively, data request analysis module 110 may communicate the message or log of abnormal behavior to an alerts' module 160 .
- Alerts module 160 may in turn send a notification (e.g., an email message) to relevant computing systems e.g., via notification module 240 of the relevant client 200 . Additionally, or alternatively alerts module 160 may communicate the notification of abnormal data action to data revocation module 230 of the relevant client 200 , to revoke or delete the relevant data from storage associated with the relevant development environment 30 .
- a notification e.g., an email message
- alerts module 160 may communicate the notification of abnormal data action to data revocation module 230 of the relevant client 200 , to revoke or delete the relevant data from storage associated with the relevant development environment 30 .
- rule engine module 115 may collaborate with alerts module 160 to communicate a notification message 100 F to a relevant client module 200 .
- Message 100 F may include a representation of the identified execution problem (e.g., rule 100 C that has been triggered).
- Client 200 may subsequently display notification message 100 F on a user interface (UI) of the user development platform 30 .
- UI user interface
- data request analysis module 110 may include a scanning service 114 .
- Scanning service 114 may interact with, and scan real-time, incoming SCEs 30 B (e.g., data requests) from clients 200 , as well as historical requests, and logs of data access footprints and data actions stored on database 80 .
- rule engine module 115 may initiate or perform at least one mitigation action 100 G (e.g., other than producing a notification message 100 F).
- rule engine module 115 may communicate with a relevant client 200 and/or cloud-based platform 50 , to block a user of the specific client module from performing a data action (e.g., accessing data stored on the at least one cloud-based platform 50 ).
- rule engine module 115 may collaborate with data management module 140 , and communicate with one or more organizational servers or databases 80 , to change access permissions (e.g., permissions associated with a role of a user, permissions associated with a specific client module, etc.) to a specific cloud-based platform 50 .
- rule engine module 115 may collaborate with tracking module 250 of a relevant client 200 , to delete data footprints of the at least one executed scripting code element, as elaborated herein.
- rule engine module 115 may communicate with a relevant client 200 to disable a specific user of a specific client module 200 from executing SCEs 30 B in their user development platform 30 . Additional mitigation actions 100 G are also possible.
- data request analysis module 110 of proxy server 100 may be adapted to monitor execution of the SCEs 30 B of the one or more ICDs 30 A by the at least one cloud-based platform 50 . Additionally, or alternatively, data request analysis module 110 may be configured to produce a log data element, representing said monitored execution of the scripting code elements. Data request analysis module 110 may communicate the log data element to a relevant client module 200 (e.g., as a notification message 100 F). Data request analysis module 110 may communicate the log data element in real-time or near real time so as to facilitate the interactivity quality of ICD 30 A, during the development process. The relevant client module 200 may, in turn display log data element (e.g., notification message 100 F) on a UI of the user development platform 30 .
- log data element e.g., notification message 100 F
- proxy server 100 of the present invention may allow real-time, cell-level logging and monitoring of execution of actions on a cloud-based platform 50 , as they are incrementally called in a development platform 30 on the client 200 side.
- an ICD 30 A (e.g., a notebook) may be requested by a user (e.g., by development environment 30 ) to be executed or scheduled for execution on distributed computing environment (e.g., cloud platform) 50 .
- Client 200 may subsequently upload or transmit ICD 30 A to proxy server, either entirely (e.g., including all SCEs 30 B of ICD 30 A), or partially (e.g., including a subset of SCEs 30 B of ICD 30 A), according to a predefined configuration.
- scanning service 114 may be configured to scan the received SCEs 30 B (e.g., code lines), according to the set of predetermined scripting rules 100 D as elaborated herein (e.g., in relation to FIG. 3 ).
- scanning service 114 may be configured to scan a subset of the received SCEs 30 B (e.g., code lines), according to a predetermined configuration.
- the one or more ICDs 30 A may include, or be associated with document metadata, such as size (e.g., in Bytes) of the ICD 30 A, a type of cells ICD 30 B in the ICD 30 A (e.g., code cells, markup cells, raw cells, as commonly referred to in the art), an identification of an owner of the ICD 30 A, an identification and/or associated permissions of a user who is currently requesting execution of the ICD 30 A, etc.
- Scanning service 114 of proxy server 100 may scan ICD 30 A scan the document metadata according to a set of predetermined scripting rules 100 D, which may be referred to herein as metadata rules 100 D′, in this context.
- Such scripting rules 100 D may include limitations to metadata of the scanned ICD 30 A, such as a limitation of size of the executed ICD 30 A, a limitation of permission to execute ICD 30 A, a limitation of execution of specific actions of code lines and/or types of CSEs 30 B, and the like. Other limitations may also be possible.
- proxy server 100 may then encapsulate the one or more ICDs 30 A and/or the workflow data element 210 A in one or more data containers 120 A, based on the scan of document metadata, e.g., conditioned that ICD 30 A metadata successfully complies to the limitations of scripting rules 100 D (e.g., metadata rules 100 D′).
- scanning service 114 may be configured to monitor execution of one or more SCEs 30 B in the received ICD 30 A, to identify SCE-level (e.g., cell-level) exceptions in the execution of the received ICD (e.g., notebook) 30 A, and/or malicious or suspicious actions, according to runtime rules 100 C, as elaborated herein (e.g., in relation to FIG. 3 ).
- SCE-level e.g., cell-level
- scanning service 114 may communicate with code tracking module 250 during run time, to notify a user and/or perform mitigation actions, in response to the identified exceptions or suspicious actions.
- one or more (e.g., each) problematic code line or action found by rules 100 D and/or 100 C may be associated with a mitigation action.
- the mitigation action may be invoked by alert message 100 F, and may include, for example sending an email to an administrator user, returning a report to the user interactive development environment 30 using client 200 , deleting specific data footprints, disabling execution of code lines in the interactive development environment 30 (either permanently or for a predefined period of time), and the like.
- proxy server 100 may include a data management module 140 .
- Data management module 140 may be configured to manage permissions, restrictions and/or mitigation actions according to specific users and/or roles.
- data management module 140 may be configured to set, or change a mitigation action according to a user's identity.
- data management module 140 may enable proxy server 100 , subject to relevant permissions, to disallow execution of a suspected SCE 30 B by a first user, and allow execution of the suspected SCE 30 B by a second user, having elevated permissions.
- data management module 140 may allow dynamic adjustment of mitigation actions.
- data management module 140 may enable proxy server 100 to run a specific ICD 30 A (e.g., a notebook) if less that a predefined number of problematic actions (or action with low severity) have been identified in the relevant ICD 30 A.
- a user of development environment 30 may utilize client 200 APIs to request manual analysis/scan of an IC D 30 A (e.g., notebook), using these modules to get a personal report.
- IC D 30 A e.g., notebook
- scanning service 114 may be configured to add web features that are extracted by data enhancement module 150 , to determine or identify a data action (e.g., a data access) request 200 B that is suspected to be a security breach. Scanning service 114 may implement anomaly detection and user behavior analysis to determine a risk and send messages on such cases to data access module 120 , as elaborated herein.
- a data action e.g., a data access
- Scanning service 114 may implement anomaly detection and user behavior analysis to determine a risk and send messages on such cases to data access module 120 , as elaborated herein.
- scanning service 114 may be configured to check data validation in case of data actions such as a data write access or a data delete access, to avoid a malicious of erroneous changes in a dataset stored on environment 50 .
- scanning service 114 may determine that a delete data access is valid by comparing, or analyzing extracted features (e.g., timestamps, dataset names, execution-code, etc.) of a current data action request 200 B in view of extracted features of historical delete data access sequences. Scanning service 114 may compare such extracted features of a predefined number of actions before and/or after deletion, corresponding to current and historical delete data action requests 200 B, to identify an anomalous deletion attempt.
- extracted features e.g., timestamps, dataset names, execution-code, etc.
- scanning service 114 may determine validity of a write/replace” by identifying a suspicious or anomalous write data access attempt by comparing extracted features such as column names or the new and old datasets and the value distribution and range per column or new and old datasets and determine whether to block the action and/or notify users.
- At least one proxy server 100 may include a data access module 120 , adapted to receive one or more ICDs 30 A and/or at least one workflow data element 210 A from a client 200 , and encapsulate the workflow data element and the one or more ICDs 30 A in one or more data containers 120 A, based on the preemptive scan (e.g., conditioned that the preemptive scan was successful).
- a data access module 120 adapted to receive one or more ICDs 30 A and/or at least one workflow data element 210 A from a client 200 , and encapsulate the workflow data element and the one or more ICDs 30 A in one or more data containers 120 A, based on the preemptive scan (e.g., conditioned that the preemptive scan was successful).
- data access module 120 may encapsulate the relevant one or more notebooks 30 A in a data container 120 A.
- data access module 120 may refrain from encapsulating the relevant one or more notebooks 30 A in a data container 120 A, and thus block any data action request 200 B included in the one or more notebooks 30 A.
- data request analysis module 110 may block execution of one or more ICDs 30 A (e.g., notebooks) based on outcome of the preemptive scan, by disallowing transmission of ICDs 30 A (e.g., whether encapsulated in data container 120 A or not) to cloud computing platform 510 .
- ICDs 30 A e.g., notebooks
- the term “container” may be used herein to refer to a standard unit or package of software that may include computer code (e.g., source code included in cells 30 B) and relevant code dependencies required for execution of that computer code.
- Such dependencies may include, for example references to specific resources, (e.g., specific elements of distributed computing environment 50 ), databases, (e.g., database 80 ), configuration of distributed processing systems (e.g., kernels, and memory of distributed processing systems 520 ), configuration of cloud services (e.g., allocated storage in storage services 531 ), and the like.
- source code e.g., source code included in cells 30 B
- container or docket 120 A may allow an application to be quickly and reliably executed, from one computing environment to another (e.g., from development environment 30 to distributed computing environment 50 ).
- data access module 120 may receive (e.g., from a user of development environment 30 , via client 200 ) one or more configurations or adjustments to container 120 A (e.g., selection of processing units, selection of memory, etc.), and may adapt container 120 A accordingly.
- container 120 A e.g., selection of processing units, selection of memory, etc.
- execution parameters may be used in computing languages to allow a user to provide arbitrary configuration information and/or setting variable values for executing computer code.
- data access module 120 may receive from the at least one client module, at least one execution parameter value, associated with a specific ICD 30 A, and may inject the received execution parameter value as metadata into the relevant ICD 30 A. Data access module 120 may then encapsulate the document metadata (e.g., including the injected execution parameter value) into the one or more data containers 120 A.
- Data access module 120 may subsequently transmit, or propagate (e.g., via the Internet) data containers 120 A to the at least one cloud-based platform 50 , which may, in turn execute at least one SCE 30 B of the one or more ICDs according to the document metadata (e.g., according to the injected execution parameter value).
- data access module 120 may be configured to transmit the one or more data containers 120 A to at least one cloud-based platform (e.g., distributed computing environment 50 ), to execute the one or more ICDs 30 A.
- cloud-based platform e.g., distributed computing environment 50
- data access module 120 may be configured to transmit the one or more data containers and/or the workflow data element to the at least one cloud-based platforms, to execute the one or more interactive computing documents according to the execution scheme.
- At least one cloud computing platform 510 of distributed computing environment 50 may be configured to execute, or run the one or more ICDs 30 A encapsulated in data containers 120 A according to the execution scheme defined by the workflow data element 210 A as elaborated herein (e.g., in relation to FIG. 3 ).
- proxy server may be configured to encapsulate workflow data element 210 A with the one or more ICDs 30 A in the one or more data containers 120 A. Additionally, or alternatively, proxy server may be configured to transmit workflow data element 210 A separately from the one or more data containers 120 A of ICDs 30 A (e.g., as separate data files) to cloud computing platform 510 of distributed computing environment 50 .
- data access module 120 may be configured to encapsulate workflow data element 210 A separately from the one or more ICDs 30 A.
- data access module 120 may encapsulate workflow data element 210 A in a first data container 120 A, and then spawn or generate one or more second data containers 120 A from the first data container 120 A.
- the second data containers 120 A may be characterized by that (a) each second data containers 120 A may inherit configurations and/or definitions of the first data container 120 A, and (b) one or more (e.g., each) container 120 A of the second data containers 120 A may encapsulate one or more ICDs 30 A of the one or more received ICDs 30 A, based on the workflow data element.
- the spawning of the one or more second data containers 120 A may manifest the execution scheme (e.g., order of execution, dependency of execution, parallelism of execution) of the relevant ICDs 30 A.
- Data access module 120 may then transmit the one or more second data containers 120 A to the at least one cloud-based platform 50 according to the execution scheme, as defined by the workflow data element 210 A. Additionally, or alternatively, the at least one cloud-based platform 50 may be configured to execute the one or more relevant ICDs 30 A, encapsulated in their respective containers 120 A, according to the execution scheme, as defined by the workflow data element 210 A.
- At least one proxy server 100 may include a data enhancement module 150 .
- data enhancement module 150 may be, or may include a data mining engine, adapted to extract data from various data sources and databases, including for example online data sources, local and/or cloud based databases, and/or rules and terms that are added manually (e.g., via input 7 of FIG. 2 ).
- data enhancement module 150 may mine for data such as machine details per cloud provider/platform/solutions, and its cost, machine-learning and data processing code snippets from code repositories (e.g. Github/Blogs/StackOverflow.com). Data enhancement module 150 may provide the mined data to a data access optimization module 130 to predict the best offer to process/retrieve/write data, as elaborated herein.
- code repositories e.g. Github/Blogs/StackOverflow.com
- data enhancement module 150 may also extract or mine relevant data for the data request analysis module 110 for the prediction of a security risk or mistake.
- data enhancement module 150 may, as part of the data mining process, query security engines/tools/Apis/knowledgebases (e.g. exploitation/vulnerability databases, https://attack.mitre.org/, DarkReading website, open source list of security detection rules) as well as code repositories and blogs on specific data breach cases (e.g. in https://stackoverflow.com/, https://github.com/) and cloud platform documentations of data retrieval/processing/write to get a refence for valid code snippets.
- Data enhancement module 150 may then aggregate the mined data to a relevant list of bad/good terms for recommendations for the data request analysis module 110 to determine/predict a security/data-breach/data-exfiltration issue.
- Data enhancement module 150 may also build a set of recommendations per historical detected potential data breach/exfiltration using the extracted data and log of actions.
- At least one proxy server 100 may include a data management module 140 , configured to enable addition, updating and/or deletion of entities such as users, roles, policies, security-rules, recommendations, data-services, clusters, schedulers, code-scripts (on every cloud platform) and datasets and dataset actions, while managing connection to customer users-databases such as Lightweight Directory Access Protocol (LDAP), Active Directory (AD), Gsuite and the like.
- Data management module may include a configuration required to connect to all users, cloud providers, data services, and data sets in order to send commands and retrieve/delete/write data from/to environment 50 .
- data management module 140 may enable a user (e.g., via input 7 of FIG. 2 ) to define an automatic processing machine/machines-cluster/job/notebook/scripts scheduler, that will interact with the data access optimization module 130 , as elaborated herein.
- Retrieval of data is being done by analyze the command(s) and compare it to all the historical commands that were sent during a cluster connection session to determine if there is a need to send the request again or simply retrieve the last result ,that was already cached on a cloud storage, while verifying the timestamp of the result to user parameters or force-retrieve parameter.
- data access optimization module 130 may be configured to further analyze the data request or execution request that was analyzed by data request analysis module 110 .
- data access optimization module 130 may split a received data action request 200 B to terms such as actions, codes, file-parameters and the like.
- Data access optimization module 130 may analyze these terms in comparison to, or in view of similar historical terms (e.g., terms included in previous data action requests 200 B).
- similarity e.g., similarity metric value 117
- TF-IDF Term Frequency-Inverse Document Frequency
- cosine similarity ML-based models, and the like.
- data access optimization module 130 may determine whether to retrieve the last action's results (e.g., operate as a cache) or to re-process it.
- access optimization module 130 may calculate (e.g., via a regression model) the amount of data requested in case of retrieve data (or simply data size in case of write/delete) and determine, using a prediction model calculate beforehand (e.g., running every predetermined interval) the cheapest and/or fastest code or API and platform or solution to run the action, choosing the best options automatically by user definitions in data management module 140 (automatic/manual changes to the auto process) ,while deciding to create new data-cluster(s) or alter the cluster machine types or number of machines (might be on various cloud providers) to process the data and/or terminate/halt current active clusters/services on different cloud platforms/services, by calculating the optimal time for the cluster to work before termination and the size of the initial cluster on creation.
- a prediction model calculate beforehand (e.g., running every predetermined interval) the cheapest and/or fastest code or API and platform or solution to run the action, choosing the best options automatically by user definitions in data management module 140 (automatic/manual changes to the
- access optimization module 130 may calculate a computing machine type and/or a required number computing machines by employing a machine learning model and statistics regressions by extracting relevant terms from the action's parameters (can be a code snippet or code file or notebook file) and gathering the relevant parameters logged from the client 200 code run module 210 (such as: ‘CUDA’, ‘GPU’, ‘TensorFlow’, ‘Keras’, “Memory failure”) and historical actions results and processing time.
- FIG. 5 is a flow diagram, depicting a method of providing an interactive development platform in a distributed computing environment 50 , according to some embodiments of the invention.
- embodiments of the invention may provide a proxy server module (e.g., proxy server 100 of FIG. 4 ), adapted to interface at least one cloud-based platform (e.g., distributed computing environment 50 of FIG. 4 ), and one or more client modules (e.g., client(s) 200 of FIG. 4 ), operatively connected to proxy server 100 , where each client module 200 is associated with a respective user development platform (e.g., development environment 30 of FIG. 3 ) such as Amazon's Sagemaker.
- a proxy server module e.g., proxy server 100 of FIG. 4
- a cloud-based platform e.g., distributed computing environment 50 of FIG. 4
- client modules e.g., client(s) 200 of FIG. 4
- each client module 200 is associated with a respective user development platform (e.g., development environment 30 of FIG. 3 ) such as Amazon's Sagemaker.
- At least one client module 200 may receive, from the respective user development platform 30 , one or more ICDs 30 A.
- ICDs 30 A are also referred to herein as notebooks, such as Amazon Sagemaker notebook documents.
- the terms ICD and notebook may be used herein interchangeably.
- each ICD or notebook 30 A may include, or represent one or more SCEs 30 B, such as notebook cells, as commonly referred to in the art.
- SCE, cells and code lines may be used herein interchangeably.
- the at least one client module 200 may produce a workflow data element (e.g., element 210 A of FIG. 4 ), that may define a scheme for executing the one or more ICDs, as elaborated herein (e.g., in relation to FIG. 4 ).
- a workflow data element e.g., element 210 A of FIG. 4
- the at least one client module 200 may produce a workflow data element (e.g., element 210 A of FIG. 4 ), that may define a scheme for executing the one or more ICDs, as elaborated herein (e.g., in relation to FIG. 4 ).
- the at least one client module 200 may transmit (e.g., via the Internet) the one or more ICDs 30 A and/or the one or more included SCEs 30 B and/or the workflow data element 210 A to proxy server 100 .
- proxy server 100 may scan the one or more SCEs elements 30 B according to a set of rules.
- proxy server 100 may perform a preemptive scan of the one or more SCEs 30 B according to a set of predetermined scripting rules 100 D (e.g., element 100 D of FIG. 3 ), as elaborated herein (e.g., in relation to FIG. 4 ).
- proxy server 100 may encapsulate the one or more interactive computing documents in one or more data containers (e.g., element 120 A of FIG. 4 ), based on the scan (e.g., provided that the scan was successful) as elaborated herein (e.g., in relation to FIG. 4 ). Additionally, or alternatively, proxy server 100 may encapsulate workflow data element 210 A with the one or more interactive computing documents in the one or more data containers 120 A.
- proxy server 100 may subsequently transmit the one or more data containers to the at least one cloud-based platforms 50 .
- Cloud-based platform 50 may be configured to execute at least one SCE 30 B of the one or more ICDs 30 A encapsulated in the one or more containers 120 A.
- proxy server 100 may transmit the one or more data containers to the at least one cloud-based platforms, to execute the at least one SCE 30 B of the one or more notebooks, according to the execution scheme, as defined by the workflow data element 210 A.
- embodiments of the invention may include a practical application for improving the technology of a computing system.
- proxy server 100 and the one or more client computing devices 200 may be configured to streamline a development process that currently requires iterative collaboration between a plurality of disciplines and stakeholders such as research engineers, data scientists, data engineers and automation engineers.
- embodiments of the invention may provide this streamlined workflow in a manner that maintains data security, and requires optimal computation resources.
- embodiments of the invention may enable following datasets that are transferred to a development environment and revoke or limit their use, while understanding source code and variables.
- embodiments of the invention may track transfer of data requests 200 B between client 200 and proxy server 100 locally/remotely and learn malicious data behaviors, that may use the retrieved data and transfer/save/write/replace/process it, in order to delete data footprints in case of a data security incident (either by a malicious user or by mistake). Additionally, embodiments of the invention may employ rules, statistics and ML models to alert on a problematic or suspicious data-related behavior and block suspicious data-related actions.
- embodiments of the invention may provide a multiple notebook development environment, that may facilitate integration of notebooks in a workflow, to follow and understand data access behavior and usage, while enabling scheduled tasks to be monitored and optimizes execution of data/machine-learning relevant commands/jobs/scripts.
- embodiments of the invention may optimize data action requests 200 B (e.g., data access requests) by analyzing the requests 200 B and determining the amount of data requested and activating the optimal request for the optimal service and choosing the optimal machine type(s) and number, in terms of cost and speed, in order to get/write/delete the data or run the data related source code in the fastest and/or cheapest combination, that may require to automatically create new big data cluster or terminate other clusters on different cloud platforms, to get the data in a scalable manner by manually/automatically defining the number of data-nodes (size of cluster) according to the request and the optimal active time for each cluster on each cloud platform for optimal processing.
- data action requests 200 B e.g., data access requests
- embodiments of the invention may provide recommendations on how to improve performance or harden data security by extracting relevant web related knowledge base and code snippets after identifying potential data exfiltration, while accessing all data sources and use various data services and data related actions and using development environment to investigate data.
Abstract
A system and method of providing an interactive development environment may include providing a proxy server module, adapted to interface at least one cloud-based platform and one or more client modules, operatively connected to the proxy server, where each client module may be associated with a respective user development platform. At least one client module may receive, from the respective user development platform, one or more interactive computing documents, commonly referred to as notebooks, each representing one or more scripting code elements, commonly referred to as cells. The proxy server may scan the one or more cells, according to a set of predetermined scripting rules, and encapsulate the one or more notebooks in one or more data containers, based on the scan. The proxy server may subsequently transmit the one or more data containers to a cloud-based platform, to execute at least one cell of the one or more notebooks.
Description
- This application is a continuation of U.S. Pat. No. 17/728,715, filed Apr. 25, 2022, entitled “SYSTEM AND METHOD OF PROVIDING AN INTERACTIVE DEVELOPMENT PLATFORM IN A DISTRIBUTED COMPUTING ENVIRONMENT”, which claims the benefit of priority of U.S. patent application Ser. No. 63/235,747, filed Aug. 22, 2021, entitled “A SYSTEM AND METHOD TO MANAGE, FOLLOW, SECURE AND OPTIMIZE DATA ACCESS”, which are both incorporated hereby by reference in their entirety.
- The present invention relates to the growing field of data engineering. More precisely, the present invention relates to providing an interactive development platform in distributed computing environments.
- Reference is now made to
FIG. 1 which is a schematic diagram, depicting a workflow of generating a machine learning (ML) model in a production environment. - Modern data research is typically done interactively, by running scripting commands on large data sets that consume large amounts of memory in a development environment, and validating the result of these commands online. This process stands in contrast to “classic” code engineering, in which code is fully written in the development environment, and subsequently compiled and debugged to check the results.
- As known in the art, and shown in
FIG. 1 , a typical process for generating an ML model may include several disciplines. One such discipline may be that of a research engineer, or a data scientists, set to analyze or ingest big data. Another discipline may be that of a data engineer, who may be set to produce a model (e.g., a NN model), that may be applied on incoming data instances or examples. A third discipline may be that of an automation engineer, who may be set to automate training and/or inference of the ML model on large amounts of data, during training, testing and/or operational work modes. - Automation engineers, data analysts, data engineers, data scientists and researchers work with various cloud computing platforms and access different kinds of data using multiple, and different Application Programming Interfaces (APIs).
- The process of transferring between data research and production requires utilization of a wide variety of code languages and repositories. For example, the different engineers and researchers typically need to employ a variety of development environments. Some of these development environments may be implemented on local computing environments (e.g., installed on computers within an organizational computer network), while others may be implemented using remote computing resources and platforms (e.g., Amazon SageMaker notebooks, and the like).
- Additionally, the process of producing an operational ML model requires intensive, iterative communication between the different disciplines. For example, as depicted in
FIG. 1 , a data scientist may acquire large amounts of data, to be iteratively validated and/or prepared or processed by a data engineer. - In another example, the creation and/or training of an ML model (e.g., a NN architecture) typically requires multiple iterations and experimentations.
- In another example, a data engineer may need to communicate with an automation engineer in an arduous, iterative process, to automate the execution of the ML model. Automation engineers normally need to review the data scientists' code, (typically written in an interactive development environment such as Jupyter notebooks), split that code into scripts and then automate the process using a workflow engine that supports these scripts. This typically results in a high-maintenance, communication-extensive iterative process between data teams and engineering teams.
- In yet another example, production-level ML models require continuous maintenance and debugging, as additional data samples are introduced.
- It may be appreciated by a person skilled in the art that the currently available, multi-level and multi-platform process, as depicted in
FIG. 1 results in a chasm between the original source code that may have been written by the data scientist, and the outcome pipeline code, that serves the automation engineer. For example, during maintenance or debugging of a production-level model, a data scientist may find it very difficult to associate their original source code (e.g., written as an Amazon SageMaker notebook), and the pipeline scripting code that was eventually executed. - In order to reduce the cycle of data research en-route generating a production solution, currently-available systems for ML production typically run the interactive code on cloud platforms. Due to extensive consumption of computing resources (e.g., data, memory, process-power), there is growing need to optimize the cost and run-time of these computing resources. This is particularly relevant in applications using ML models.
- Additionally, the multitude of data access requests, and data transferals among the different platforms and levels typically makes the task of maintaining production data security a daunting one. Access of production data, that may consist of sensitive or private information may be challenging, as it requires definition and enforcement of complex roles and policies over all cloud computing platforms.
- In other words, cloud computing solutions may be required to mitigate or prevent compromise of production data, and also provide information regarding unauthorized access to this data (e.g., who accessed what data, and for how long). This may require extraction of logs from multiple systems, over different cloud computing platforms.
- Additionally, in case of a suspected data breach or a security concern, there is an instant need to perform mitigation actions that would include, for example, blocking access to breached data. However, since the data, once accessed, is on other (e.g., local, or remote) computing environments, this mission may become very challenging. For example, an unwary data user (e.g., a data-scientist or software engineer) or a malicious perpetrator may save copies of breached data, making the breach of data hard to follow, in effort to avoid further data leakage and data exfiltration risks.
- As elaborated herein, embodiments of the invention may enable data scientists, data engineers and automation engineers to safely access production data, and run automation workflows via an interactive development environment. The term “safely” may be used in this context in the sense that embodiments of the invention may enforce one or more scripting rules, and/or one or more run-time rules to mitigate or prevent data breach, and minimize the risk for data exfiltration.
- Additionally, embodiments of the invention may also determine and/or enforce one or more security rules. Such rules may, for example, dictate what actions need to be done in case a specific data breach has been identified. For example, embodiments may determine that specific users or services would be denied access to data, that specific roles and/or permissions need to be revoked, and the like.
- It may be appreciated that each type of cloud computing platform may require unique APIs and technologies, and incur different costs for accessing or processing resident data. As elaborated herein, embodiments of the invention may be configured to interact with a variety of data services and/or cloud computing platforms, to optimally execute computing code on the production data, via the interactive development environment. The term “optimally” may be used in this context in the sense that embodiments of the invention may enforce one or utilization rules to provide optimal execution of the computing code in relation to predefined metrics, such as cost metrics, throughput metrics, speed metrics and the like.
- Embodiments of the invention may include a proxy server for providing an interactive development environment. The proxy server may be configured to operatively connect to one or more client modules, each associated with a respective user development platform; receive, from at least one client module of the one or more client modules, one or more interactive computing documents such as Jupyter notebooks. Each such interactive computing document may include one or more scripting code elements also referred to herein as cells or code lines.
- According to some embodiments, the proxy server may be configured to scan the one or more scripting code elements according to a set of predetermined scripting rules; encapsulate the one or more interactive computing documents in one or more data containers, based on said scan; and transmit the one or more data containers to at least one cloud-based platform, to execute at least one scripting code element of the one or more interactive computing documents.
- According to some embodiments, the proxy server may receive, from the at least one client module, a workflow data element, defining a scheme for executing the one or more interactive computing documents. The proxy server may encapsulate the workflow data element with the one or more interactive computing documents in the one or more data containers; and transmit the one or more data containers to at least one cloud-based platform, to execute the at least one scripting code element of one or more interactive computing documents according to the execution scheme, as defined by the workflow data element.
- According to some embodiments, the scheme for executing the one or more interactive computing documents may include at least one of: (a) scheduling of execution of one or more interactive computing documents; (b) an order of execution of one or more interactive computing documents; (c) a condition for executing one or more interactive computing documents; and (d) concurrence of execution of two or more interactive computing documents.
- Additionally, or alternatively, the one or more interactive computing documents may include document metadata. The proxy server may be configured to scan the document metadata according to a set of predetermined metadata rules; and encapsulate the workflow data element and the one or more interactive computing documents in one or more data containers, based on said scan of document metadata (e.g., when the document metadata was approved by the scan).
- Additionally, or alternatively, the proxy server may receive, from the at least one client module, at least one execution parameter value, associated with a specific interactive computing document. The proxy server may inject the received execution parameter value as metadata into the relevant interactive computing document; and further encapsulate the document metadata in the one or more data containers. The at least one cloud-based platform may, in turn, be configured to execute the at least one scripting code element of the one or more interactive computing documents according to document metadata.
- According to some embodiments, the proxy server may be configured to accumulate a subset of scripting code elements, each representing one or more data actions. The term “data actions” may refer herein to any action that may be performed on data, including for example: accessing data (e.g., reading data writing data, deleting data, etc.) that is stored on a computing device such as a cloud-based platform, importing software modules, running, or executing specific commands or sets of commands on the computing device, and the like. The proxy server may classify the subset of scripting code elements, in relation to at least one client module, to one or more categories or classifications (e.g., legitimate action, illegitimate action, and the like). For example, proxy server may classify the subset of scripting code elements in relation to a specific client module that initiated the data action. The proxy server may subsequently determine at least one scripting rule based on said classifications.
- Additionally, or alternatively, the proxy server may include, or be associated with a feature extraction module, configured to extract one or more data action features from at least one specific scripting code element of the subset of scripting code elements. The proxy server may also include, or be associated with an ML based model (e.g., a classification model), configured to classify the specific scripting code element according to the one or more data action classifications, based on the one or more data action features.
- Additionally, or alternatively, the proxy server may be configured to obtain, from the ML based model, a confidence score, representing pertinence of one or more scripting code elements to the one or more data action classifications; and determine at least one scripting rule based on the confidence score. The proxy server may subsequently enforce the at least one scripting rule on one or more scripting code elements, as elaborated herein.
- According to some embodiments, the proxy server may be configured to encapsulate the workflow data element and the one or more interactive computing documents by: encapsulating the workflow data element in a first data container; spawning one or more second data containers, each encapsulating one or more interactive computing documents of the one or more interactive computing documents based on the workflow data element; and transmitting the one or more second data containers to the at least one cloud-based platform according to the execution scheme, as defined by the workflow data element.
- According to some embodiments, the proxy server may include, or be associated with a monitoring module, adapted to monitor execution of the scripting code elements (of the one or more interactive computing documents) by the at least one cloud-based platform. The proxy server may also include, or be associated with a rule engine module, which may be adapted to identify a problem in execution according to a set of predefined runtime rules (e.g., upon triggering of a runtime rule) based on said monitoring, as elaborated herein.
- For example, the rule engine module may be configured to communicate a notification message to a relevant client module (e.g., a client module which initiated the relevant scripting code element). The notification message may include a representation of the identified execution problem, to be displayed on a user interface (UI) of the user development platform.
- In another example, the rule engine module may be configured to perform at least one mitigation action, according to the identified execution problem (e.g., according to the triggered runtime rule). The mitigation action may include, for example blocking a user of a specific client module from accessing data stored on the at least one cloud-based platform, changing an access permission of a role of a user of a specific client module to a specific cloud-based platform, deleting data footprints of the at least one executed scripting code element, disabling a specific user of a specific client module from executing scripting code elements in their user development platform, and the like.
- Additionally, or alternatively, the proxy server may monitor execution of the scripting code elements by performing run-time monitoring of logs generated by the at least one cloud-based platform; run-time monitoring of an order of execution of the scripting code elements; run-time monitoring of output of the executed scripting code elements; monitoring duration of execution of at least one scripting code element of the one or more interactive computing documents; performing a comparison between output statistics of a current execution of an interactive computing document and output statistics of a previous execution of the interactive computing document; identifying at least one difference between an output of a specific scripting code element at a first execution and an output of the specific scripting code element at a second execution; identifying at least one difference between a first metadata element, associated with a specific scripting code element at a first execution and a second metadata element, associated with the specific scripting code element at a second execution; and any combination thereof.
- Additionally, or alternatively, the monitoring module may be adapted to monitor execution of the scripting code elements of the one or more interactive computing documents by the at least one cloud-based platform; produce a log data element, representing said monitored execution of the scripting code elements; and communicate the log data element to a relevant client module (e.g., a client module which initiated, or sent the relevant interactive code element). The proxy server may send the log data element in real-time or near real time, to facilitating an aspect of interactivity of execution of the interactive code element. The client module may, in turn, be configured to display the log data element on a UI of the user development platform.
- Embodiments of the invention may include a system for providing an interactive development environment. Embodiments of the system may include a proxy server module, adapted to interface at least one cloud-based platform; and one or more client modules operatively connected to the proxy server, where each client module may be associated with a respective user development platform.
- According to some embodiments, the at least one client module of the one or more client modules may be configured to receive, from the respective user development platform, one or more interactive computing documents, each representing one or more scripting code elements; and transmit the one or more interactive computing documents to the proxy server. The proxy server may be configured to scan the one or more scripting code elements according to a set of predetermined scripting rules; encapsulate the one or more interactive computing documents in one or more data containers, based on said scan; and transmit the one or more data containers to the at least one cloud-based platforms, to execute at least one scripting code element of the one or more interactive computing documents.
- Additionally, or alternatively, the at least one client module may be configured to produce a workflow data element, defining a scheme for executing the one or more interactive computing documents; and transmit the workflow data element to the to the proxy server. The proxy server may be configured to transmit the one or more data containers and the workflow data element to the at least one cloud-based platforms, to execute the one or more interactive computing documents according to the execution scheme, as defined by the workflow data element.
- According to some embodiments, the proxy server may be configured to encapsulate the workflow data element with the one or more interactive computing documents in the one or more data containers.
- Additionally, or alternatively, the proxy server may be further configured to, during execution of the one or more interactive computing documents, scan at least one scripting code element according to the set of predetermined scripting rules, to verify at least one of (a) permission-based legitimacy of data retrieval by the at least one scripting code element, and (b) lack of malicious activity in the at least one scripting code element.
- Embodiments of the invention may include a method of providing an interactive development environment. Embodiments of the method may include providing a proxy server module, adapted to interface at least one cloud-based platform; providing one or more client modules, operatively connected to the proxy server, where each client module may be associated with a respective user development platform; receiving by at least one client module, from the respective user development platform, one or more interactive computing documents, each representing one or more scripting code elements; scanning the one or more scripting code elements by the proxy server, according to a set of predetermined scripting rules; encapsulating, by the proxy server, the one or more interactive computing documents in one or more data containers, based on said scan; and transmitting, by the proxy server, the one or more data containers to the at least one cloud-based platforms, to execute at least one scripting code element of the one or more interactive computing documents.
- Embodiments of the invention may further include producing, by the at least one client module, a workflow data element, defining a scheme for executing the one or more interactive computing documents; encapsulating, by the proxy server, the workflow data element with the one or more interactive computing documents in the one or more data containers; and transmitting, by the proxy server, the one or more data containers to the at least one cloud-based platforms, to execute the one or more interactive computing documents according to the execution scheme, as defined by the workflow data element.
- The subject matter regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention, however, both as to organization and method of operation, together with objects, features, and advantages thereof, may best be understood by reference to the following detailed description when read with the accompanying drawings in which:
-
FIG. 1 is a schematic diagram, depicting a workflow of generating a machine learning model in a production environment, as known in the art. -
FIG. 2 is a block diagram, depicting a computing device which may be included in a system for providing an interactive development platform in a distributed computing environment, according to some embodiments of the invention; -
FIG. 3 is a block diagram, depicting application of a system for providing an interactive development platform in a distributed computing environment, according to some embodiments of the invention; -
FIG. 4 is a block diagram, depicting a system for providing an interactive development platform in a distributed computing environment, according to some embodiments of the invention; -
FIG. 5 is a flow diagram, depicting a method of providing an interactive development platform in a distributed computing environment, by at least one processor, according to some embodiments of the invention. - It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.
- One skilled in the art will realize the invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The foregoing embodiments are therefore to be considered in all respects illustrative rather than limiting of the invention described herein. Scope of the invention is thus indicated by the appended claims, rather than by the foregoing description, and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
- In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, and components have not been described in detail so as not to obscure the present invention. Some features or elements described with respect to one embodiment may be combined with features or elements described with respect to other embodiments. For the sake of clarity, discussion of same or similar features or elements may not be repeated.
- Although embodiments of the invention are not limited in this regard, discussions utilizing terms such as, for example, “processing,” “computing,” “calculating,” “determining,” “establishing”, “analyzing”, “checking”, or the like, may refer to operation(s) and/or process(es) of a computer, a computing platform, a computing system, or other electronic computing device, that manipulates and/or transforms data represented as physical (e.g., electronic) quantities within the computer's registers and/or memories into other data similarly represented as physical quantities within the computer's registers and/or memories or other information non-transitory storage medium that may store instructions to perform operations and/or processes.
- Although embodiments of the invention are not limited in this regard, the terms “plurality” and “a plurality” as used herein may include, for example, “multiple” or “two or more”. The terms “plurality” or “a plurality” may be used throughout the specification to describe two or more components, devices, elements, units, parameters, or the like. The term “set” when used herein may include one or more items.
- Unless explicitly stated, the method embodiments described herein are not constrained to a particular order or sequence. Additionally, some of the described method embodiments or elements thereof can occur or be performed simultaneously, at the same point in time, or concurrently.
- Reference is now made to
FIG. 2 , which is a block diagram depicting a computing device which may be included in a system for providing an interactive development platform in a distributed computing environment, according to some embodiments of the invention. -
Computing device 1 may include a processor orcontroller 2 that may be, for example, a central processing unit (CPU) processor, a chip or any suitable computing or computational device, anoperating system 3, amemory 4,executable code 5, astorage system 6,input devices 7 andoutput devices 8. Processor 2 (or one or more controllers or processors, possibly across multiple units or devices) may be configured to carry out methods described herein, and/or to execute or act as the various modules, units, etc. More than onecomputing device 1 may be included in, and one ormore computing devices 1 may act as the components of, a system according to embodiments of the invention. -
Operating system 3 may be or may include any code segment (e.g., one similar toexecutable code 5 described herein) designed and/or configured to perform tasks involving coordination, scheduling, arbitration, supervising, controlling or otherwise managing operation ofcomputing device 1, for example, scheduling execution of software programs or tasks or enabling software programs or other modules or units to communicate.Operating system 3 may be a commercial operating system. It will be noted that anoperating system 3 may be an optional component, e.g., in some embodiments, a system may include a computing device that does not require or include anoperating system 3. -
Memory 4 may be or may include, for example, a Random-Access Memory (RAM), a read only memory (ROM), a Dynamic RAM (DRAM), a Synchronous DRAM (SD-RAM), a double data rate (DDR) memory chip, a Flash memory, a volatile memory, a non-volatile memory, a cache memory, a buffer, a short term memory unit, a long term memory unit, or other suitable memory units or storage units.Memory 4 may be or may include a plurality of possibly different memory units.Memory 4 may be a computer or processor non-transitory readable medium, or a computer non-transitory storage medium, e.g., a RAM. In one embodiment, a non-transitory storage medium such asmemory 4, a hard disk drive, another storage device, etc. may store instructions or code which when executed by a processor may cause the processor to carry out methods as described herein. -
Executable code 5 may be any executable code, e.g., an application, a program, a process, task or script.Executable code 5 may be executed by processor orcontroller 2 possibly under control ofoperating system 3. For example,executable code 5 may be an application that may provide an interactive development environment, as further described herein. Although, for the sake of clarity, a single item ofexecutable code 5 is shown inFIG. 2 , a system according to some embodiments of the invention may include a plurality of executable code segments similar toexecutable code 5 that may be loaded intomemory 4 and causeprocessor 2 to carry out methods described herein. -
Storage system 6 may be or may include, for example, a flash memory as known in the art, a memory that is internal to, or embedded in, a micro controller or chip as known in the art, a hard disk drive, a CD-Recordable (CD-R) drive, a Blu-ray disk (BD), a universal serial bus (USB) device or other suitable removable and/or fixed storage unit. Data pertaining to an interactive development environment may be stored instorage system 6 and may be loaded fromstorage system 6 intomemory 4 where it may be processed by processor orcontroller 2. In some embodiments, some of the components shown inFIG. 2 may be omitted. For example,memory 4 may be a non-volatile memory having the storage capacity ofstorage system 6. Accordingly, although shown as a separate component,storage system 6 may be embedded or included inmemory 4. -
Input devices 7 may be or may include any suitable input devices, components or systems, e.g., a detachable keyboard or keypad, a mouse and the like.Output devices 8 may include one or more (possibly detachable) displays or monitors, speakers and/or any other suitable output devices. Any applicable input/output (I/O) devices may be connected toComputing device 1 as shown byblocks input devices 7 and/oroutput devices 8. It will be recognized that any suitable number ofinput devices 7 andoutput device 8 may be operatively connected toComputing device 1 as shown byblocks - A system according to some embodiments of the invention may include components such as, but not limited to, a plurality of central processing units (CPU) or any other suitable multi-purpose or specific processors or controllers (e.g., similar to element 2), a plurality of input units, a plurality of output units, a plurality of memory units, and a plurality of storage units.
- A neural network (NN) or an artificial neural network (ANN), e.g., a neural network implementing a machine learning (ML) or artificial intelligence (AI) function, may refer to an information processing paradigm that may include nodes, referred to as neurons, organized into layers, with links between the neurons. The links may transfer signals between neurons and may be associated with weights. A NN may be configured or trained for a specific task, e.g., pattern recognition or classification. Training a NN for the specific task may involve adjusting these weights based on examples. Each neuron of an intermediate or last layer may receive an input signal, e.g., a weighted sum of output signals from other neurons, and may process the input signal using a linear or nonlinear function (e.g., an activation function). The results of the input and intermediate layers may be transferred to other neurons and the results of the output layer may be provided as the output of the NN. Typically, the neurons and links within a NN are represented by mathematical constructs, such as activation functions and matrices of data elements and weights. A processor, e.g., CPUs or graphics processing units (GPUs), or a dedicated hardware device may perform the relevant calculations.
- Reference is now made to
FIG. 3 , which depicts an example of an application of asystem 10 for providing an interactive development platform in a distributedcomputing environment 50, according to some embodiments of the invention. - As elaborated herein,
system 10 may be, or may include a client-server platform, configured to manage access and processing of production data, via an interactive development environment, in an optimized and secure manner. - According to some embodiments of the invention,
system 10 may be implemented as a software module, a hardware module, or any combination thereof. For example,system 10 may be or may include a computing device such aselement 1 ofFIG. 2 , and may be adapted to execute one or more modules of executable code (e.g.,element 5 ofFIG. 2 ) to manage access and processing of production data, as further described herein. - As shown in
FIG. 3 , arrows may represent flow of one or more data elements to and fromsystem 10 and/or among modules or elements ofsystem 100. Some arrows have been omitted inFIG. 3 for the purpose of clarity. - According to some embodiments,
system 10 may include one ormore client modules 200 that may be integrated into, or may collaborate with adevelopment environment 30 such as Amazon Sagemaker notebook. Additionally,system 10 may include at least oneproxy server module 100 that may be integrated into, or may collaborate with a plurality ofcloud computing platforms 510 such as Amazon EMR, Google cloud and Microsoft Azure, and distributedprocessing systems 520 such as Apache spark. - According to some embodiments, the at least one
proxy server module 100 and the one ormore client modules 200 may be implemented on the same computing device (e.g.,element computing device 1 ofFIG. 2 ). Alternatively, the at least oneproxy server module 100 and the one ormore client modules 200 may be implemented onseparate computing devices 1, and may be communicatively connected via a computer communication network, such as the Internet. - As known in the art,
cloud computing platforms 510 and/or distributedprocessing systems 520 may facilitate access to a large variety of cloud basedservices 530, including for example cloud storage services andresources 531 such as Amazon S3,cloud analytics databases 532 such as machine-learning (ML) or Business Intelligence (BI) analytics engines (e.g., Amazon Redshift), anddata integration services 533 such as Amazon glue. - As elaborated herein,
system 10 may provide centralized hub for following, detecting, and/or preventing data access to various cloud platforms and data providers.System 10 may identify violations of predefined access rules or privileges, and produce recommendations for implementation of actions upon identification of such conditions, to prevent compromise of production data. - According to some embodiments,
system 10 may collaborate, or be integrated into interactive development environments such as Jupyterlab, Zeppelin, and the like. The term “interactive” may be used in this context to indicate thatdevelopment environment 30 may enable a user (e.g., a developer) to interactively or iteratively analyze data and see the results of their analysis (e.g., in the form of datasets, graphic displays, and the like), during, or as part of the development process. In a similar manner, the term “interactive computing documents” may be used herein to refer to data elements such as Jupyter notebooks, that may be employed bysystem 10 to facilitate this quality of interactivity during, or following the process of development, as elaborated herein. - According to some embodiments,
system 10 may include anoptimization component 10A, adapted to implement a data access optimization process. The system optimization component may gather historical statistic data from previous executions of scripts in aninteractive development environment 30 such as Jupyter Notebook. Such historical statistic data may include, for example metadata representing processing time of scripting code elements (SCEs) such as Notebook cells, an amount (e.g., in data bytes) of data that had been accessed (e.g., read access, write access) by the SCEs, a size of datasets resulting from execution of the SCEs, one or more types or properties of execution solution (e.g., computing devices or platforms used for processing the SCEs), and the like. - The optimization component may subsequently, and automatically choose a configuration that provides optimal execution of target SCEs in relation to predefined metrics, such as cost metrics, throughput metrics, speed metrics and the like.
- For example, the optimization component may automatically select, and/or configure a
cloud computing platform 510 such as Amazon EMR, Google Cloud or Microsoft Azure, to provide optimal execution of one or more target SCEs (e.g., cells) in aninteractive development environment 30 such as Jupyter Notebook. - In another example, the optimization component may automatically select, and/or configure a cloud solution such a cloud based service 530 (e.g.,
cloud storage services 531,cloud analytics databases 532 and/or data integration services 533), to provide an optimal execution of the one or more target SCEs. - In another example, the optimization component may automatically select or configure a number, a type, and/or one or more properties of computing devices of distributed
processing systems 520. Such properties may include, for example a size of memory associated with the distributedprocessing systems 520, a number and/or a type of processing cores associated with the distributedprocessing systems 520, and the like. - According to some embodiments,
proxy server 100 may by operatively connected to one ormore client modules 200, each associated with, or included in a respectiveuser development platform 30. - For example,
development platform 30 may be, or may include a platform such as Amazon's Sagemaker, where a user or developer may draft or write one or more interactive computing documents (ICDs) 30A, commonly referred to as notebook documents (or “notebooks”, for short). The terms “interactive computing documents”, “ICD” and “notebooks” may be used herein interchangeably. Additionally, eachsuch notebook document 30A may include one or more scripting code elements (SCEs) 30B, commonly referred to as “cells”. The terms “scripting code elements”, “SCE”, and “cells” may be used herein interchangeably. - According to some embodiments, each
client module 200 may associated with arespective development platform 30. For example,client module 200 may be implemented as a software module or an add-on that may interface, or communicate with a respective development platform 30 (e.g., the Sagemaker platform).Client module 200 may also operatively connect to, or communicate with one or moreproxy servers 100 on the other hand, to provide capabilities ofsystem 10 as elaborated herein. - In some embodiments,
development platform 30 andclient module 200 may be installed, or included on the same computing device, such ascomputing device 1 ofFIG. 2 . Alternatively,development platform 30 andclient module 200 may be implemented onseparate computing devices 1, and may be communicatively connected via a computer network such as the Internet. - According to some embodiments,
client 200 may receive, fromdevelopment platform 30 one or more ICDs (e.g., notebooks) 30A, that may include one or more SCEs (e.g., cells) 30B.Client 200 may transmit the one or more receivedICDs 30A toproxy server 100, for example as a data action, such as adata access request 200B.Proxy server 100 may, in turn receive the one or more data actions (e.g., data access requests) 200B (e.g.,ICD 30A) from at least oneclient module 200, and may communicate with distributedcomputing environment 50 to execute, or process the one or more data actions (e.g., data access requests 200B) (e.g.,ICDs 30A) oncloud computing platforms 510 of the distributedcomputing environment 50. - As elaborated herein,
proxy server 100 andclient 200 may jointly implement an optimization aspect orcomponent 10A ofsystem 10, to improve aspects of efficiency in execution of one ormore cells 30B. - For example,
proxy server 100 may implement a caching system, to improve throughput or latency of execution. For example,proxy server 100 may be configured to maintain (e.g., instorage 6 ofFIG. 2 ) historical results of executedcell 30B processes.Proxy server 100 may receive from one ormore clients 200 one or morenew cells 30B, to be executed or processed in distributedcomputing environment 50.Proxy server 100 may automatically select whether to return previously retrieved results from cache (e.g., from storage 6), or communicate the newly received cells to distributedcomputing environment 50, in order to re-run the processing. - In another example,
client 200 may be configured to scan a source code in one ormore notebooks 30A orcells 30B received frominteractive development environment 30 according to a predetermined set of rules, to find failures in the source code.Client 200 may transmit anotification message 200A toserver 100, that may include information regarding outcome of the scan. Such information may include, for example, whether any failures in the source code were found and/or whether the server is required to perform further analysis to decide whether any optimization actions are required. For example, failure of execution of source code in acell 30B may be caused due to timeout of processing or lack of memory in acloud computing platforms 510. In such a condition,proxy server 100 may determine that an upgrade (e.g., addition of memory space and or additional allocation of processors) ofcloud computing platforms 510 is required. - In another example,
client 200 may be configured to scan a source code ofcells 30B to identify indications in the source code that require usage of specific, or special computing devices. For example,client 200 may identify code snippets that make use of “GPU” or “CUDA” indications, or require usage of machine-learning libraries or APIs such as TensorFlow or Keras.Client 200 may transmit anindication 200A of such findings toproxy server 100 for further analysis, to decide whether a GPU or TPU computing device should be used for processing or retrieving source code of the scannedcell 30B. - Additionally, or alternatively, and as elaborated herein,
proxy server 100 andclient 200 may jointly implement a security aspect orcomponent 10B ofsystem 10, to improve aspects of data security prior to, and during execution of one ormore cells 30B. - For example,
proxy server 100 and/orclient 200 may be configured to perform a preemptive scan of one or more SCEs of the interactive development environment 30 (e.g., source code included innotebooks 30A and/orcells 30B) according to one or more (e.g., a set of)predetermined scripting rules 100D. The term “preemptive” may be used herein to indicate a scan that may be performed prior to communicating the cells' 30B source code to distributedcomputing environment 50 for processing. - For example, during a preemptive scan,
proxy server 100 may be configured to scan source code of one ormore cells 30B based on thepredetermined scripting rules 100D, to find utilization of elements such as commands, actions, API calls and import objects (e.g., libraries such as Pandas, AWS Data wrangler or direct spark commands) that may be used to retrieve data from distributedcomputing environment 50 in a manner that circumvents, or is not performed viasystem 10.Proxy server 100 may log such occurrences, and/or implement preemptive measures or actions to mitigate or avoid risk of data exfiltration. For example,proxy server 100 may be configured to block access of a specific entity (e.g., aspecific client 200, a specific user, a specific account or role, a specific cell or notebook, and the like) to relevant elements (e.g.,cloud services 530, cloud computing platforms orclusters 510, datasets and the like) of distributedcomputing environment 50. - Additionally, or alternatively,
proxy server 100 and/orclient 200 may be configured to perform a runtime scan of one or more SCEs of theinteractive development environment 30. The term “runtime” may be used herein to indicate a scan that may be performed in conjunction to running, or execution of source code in one ormore cells 30B ornotebooks 30A. - For example, during a runtime scan,
proxy server 100 may track manipulation and/or diversion of data (or a data-set, as commonly referred to in the art) retrieved from distributedcomputing environment 50. - In some embodiments,
proxy server 100 may hold acopy 100A of one or more relevant, retrieved elements of the data-set. Additionally, or alternatively,proxy server 100 may maintain aruntime log 100B of actions that have been performed on data elements in the data-set.Runtime log 100B may include, for example, code lines of executed data manipulations, timestamps of executed data manipulations, results or outputs of executed data manipulations, relevant object names, relevant file names, relevant database tables and the like. - According to some embodiments, during (e.g., as part of) the runtime scan,
proxy server 100 may be configured to continuously (e.g., repeatedly, over time) perform runtime analysis of thecopy 100A of one or more elements of the data-set (referred to herein as “copy 100A”) and/orruntime log 100B. In this runtime analysis,proxy server 100 may be configured to find relevant data elements and corresponding relevant commands, that may be regarded as suspicious, according to a predetermined set ofruntime rules 100C. - For example, a
runtime rule 100C may be implemented as a data structure (e.g., a table), that may associate at least one element in distributed computing environment 50 (e.g., a portion of a dataset) with at least one rule or restriction to be imposed upon that element. For example, aruntime rule 100C may dictate that at least one data element of a dataset stored by acloud storage service 531 of distributedcomputing environment 50 may be prohibited from being stored elsewhere (e.g., locally), on a storage device (e.g.,element 6 ofFIG. 2 ) associated with acomputing device 1 wheredevelopment environment 30 is deployed. It may be appreciated thatadditional runtime rules 100C may also be generated, or used, depending on specific characterization ofproxy server 100. -
Proxy server 100 may be configured to identify an execution problem as a condition in whichruntime rule 100C has been triggered. Pertaining to the example above, an identified execution problem may be a condition in which anSCE 30B has attempted to store the relevant data element in a prohibited location. - Additionally,
proxy server 100 and/orclient 200 may be configured to enforce the runtime rules, to mitigate risks to data security, based on the runtime analysis (e.g., based on identification of an execution problem, or trigger of aruntime rule 100C). - Pertaining to the same example, in case that an illegitimate storage command was received,
proxy server 100 may be configured to communicate amitigation action 100G as analert message 100F (e.g., arevocation alert message 100F) toclient 200, which may be configured to perform amitigation action 100G (e.g., revocation action) in the development environment, based onmessage 100F, so as to delete the relevant data copies and protect against data exfiltration. - Additionally, or alternatively, and as elaborated herein,
system 10 may serve as a virtualization layer, enabling users ofdevelopment environment 30 to configure, and interface distributedcomputing environment 50 of various types and configurations. - For example,
system 10 may include a development environment integrated library that may facilitate aunified API 100E to access a multitude of types of cloud-based services and/or data elements viaproxy server 100. UnifiedAPI 100E may be adapted to enable a streamline user definition of computing platforms and clusters, management of cloud-based datasets, and execution of big-data code scripts in a manner that is oblivious to the specific underlying cloud-based environment technology. - Additionally, or alternatively,
system 10 may utilize unifiedAPI 100E to enable user data access and roles' management, so as to allow access to specific data, and/or run specific data manipulation scripts, while monitoring one or more (e.g., every) incident of data access. Based on this monitoring,system 10 may determine maliciousness of incidents of data access, and optionally block access in case of malicious or unauthorized access is detected, as elaborated herein. Additionally,system 10 may be configured to revoke access and/or automatically delete any data leftovers, such as development environment variables that may be stored or cached in one ormore computing devices 1 associated withdevelopment environment 30. - Additionally, or alternatively,
system 10 may be configured to provide real-time alerts to any predetermined endpoint, such as a predefined email, a text message (e.g., a short messaging service (SMS) message) associated with a specific person, a chat-based service, associated with a specific account, and the like. - For example,
system 10 may identify an unauthorized access to a cloud-based element (e.g., a data element stored by acloud storage service 531 of distributed computing environment 50). Such identification may be based, for example, on (b) automatic detection of an anomalous incident of data access; (b) a rule-based monitoring of data access; (c) revocation (manual or automatic) of a policy, a role, and/or a dataset; (d) expiration of a timeframe for accessing the relevant, specific dataset, and the like. In such embodiments,server 100 may generate analert notification 100F, and may sendnotification 100F e.g., as an email message to one or more relevant accounts. - Additionally, or alternatively,
notification 100F may include a command, such as a command to perform amitigation action 100G (e.g., delete a specific data element).Notification 100F may be sent to one or more (e.g., all)clients 200 that are associated withrelevant development environments 30. Theserelevant development environments 30 may bedevelopment environments 30 that hold the relevant data, or that are associated with other memory devices, data storage devices, caches, files etc., that were used to save the relevant data.Clients 200 may subsequently proceed to actively delete or purge the relevant data elements as dictated bynotification 100F. - Reference is now made to
FIG. 4 , which is a block diagram, depicting another example of asystem 10 for providing an interactive development platform in a distributedcomputing environment 50, according to some embodiments of the invention. It may be appreciated thatsystem 10 ofFIG. 4 may be the same assystem 10 ofFIG. 3 . Some elements ofFIG. 3 have been omitted inFIG. 4 for the purpose of clarity. - As shown in
FIG. 4 , at least oneclient 200 may include acode run module 210, adapted to receive, from arespective development environment 30 one or more ICDs (e.g., notebooks) 30A, each including one or more SCEs (e.g., cells) 30B. -
Code run module 210 may be configured to process the receivedSCEs 30B to extract one or more entities, intended for obtaining a specific data set. Such entities may include, for example, commands, SQL syntax queries, source code elements (e.g., spark or map-reduce code, notebook .pynb files, etc.) and the like.Code run module 210 may be configured to utilize unifiedAPI 100E to convert the one or more entities it to a relevant format, and send the converted ICDs (e.g., notebooks) 30A and/or SCEs (e.g., cells) 30B toproxy server 100.Proxy server 100 may in turn communicate the convertednotebooks 30A and/orcells 30B to distributedcomputing environment 50, so as to perform data actions (e.g., data access actions, such as write data into, or delete data from a specific, allowed location on a specific cloud-based service 530). - Additionally, or alternatively,
code run module 210 may be configured to produce at least one ICD (e.g., a notebook) 30A that may be referred to herein as aworkflow data element 210A (or “workflow” 210A for short).Workflow data element 210A may be, or may include a data structure (e.g., a table) that may associate between execution properties of one ormore notebooks 30A and/orcells 30B. In other words,workflow 210A may define a scheme for executing the one or more ICDs 30A and/orSCEs 30B. - For example,
workflow data element 210A may define a scheme for executing one or more ICDs 30A in a sense of scheduling of execution of one ormore notebooks 30A orcells 30B. Such scheduling may include, for example setting a trigger (e.g., a timestamp, an interrupt, etc.) for executing the one ormore notebooks 30A orcells 30B by acloud computing platform 510 of distributedcomputing environment 50. - In another example,
workflow data element 210A may define a scheme for executing one or more ICDs 30A in a sense of setting an order of execution of one ormore notebooks 30A orcells 30B. Such ordering may include, for example setting afirst notebook 30A to be executed by acloud computing platform 510, and setting asecond notebook 30A to be executed bycloud computing platform 510 only after execution of thefirst notebook 30A is completed. - In another example,
workflow data element 210A may define a scheme for executing one or more ICDs 30A in a sense of setting a logical condition for executing the one ormore notebooks 30A orcells 30B (e.g., execute afirst notebooks 30A only if a first condition is met, and execute asecond notebooks 30A only if a second condition is met). - In another example,
workflow data element 210A may define a scheme for executing one or more ICDs 30A in a sense of setting up a condition of concurrence of execution of two ormore notebooks 30A orcells 30B. For example, setting up a plurality of cells to be executed in parallel (e.g., substantially at the same time), e.g., by using a plurality of processors ofcloud computing platform 510. - According to some embodiments,
workflow data element 210A may be received, e.g., from a user or another computing device (e.g., viainput 7 ofFIG. 2 ). Additionally, or alternatively,code run module 210 may receive (e.g., as a script, via a user interface, and the like) one or more parameters defining characteristics or properties of aworkflow data element 210A (e.g., ordering of notebook execution, concurrence of execution, condition of concurrence of execution, timing of notebook execution, and the like), and may produceworkflow data element 210A so as to define a scheme for executing one or more ICDs 30A, based on the one or more received parameters. - As elaborated herein,
code run module 210 may be configured to transmitworkflow data element 210A toproxy server 100, which may in turn communicateworkflow 210A to distributedcomputing environment 50, in order to execute the one ormore notebooks 30A orcells 30B according to the execution scheme defined byworkflow 210A. - According to some embodiments, at least one
client 200 may include adataflow monitoring module 220, configured to extract relevant entities of the receivednotebooks 30A orcells 30B. Such extracted entities may include, for example variables, file-names, local-database-actions, names, cache-engines and/or cookies used in the development environment that include, or are a copy or an output of a dataset retrieved from distributedcomputing environment 50. Additionally, the extracted entities may include parameters that are included incells 30B, that may be utilized to determine legitimacy of data actions (e.g., data access). - According to some embodiments,
dataflow monitoring module 220 may be configured to log every code snippet that the dataset name or a copy that was identified as a copy is mentioned in, and send the logs toproxy server 100, while keeping a local encrypted log that may be used by adata revocation module 230, as elaborated herein. - According to some embodiments, at least one
client 200 may include adata revocation module 230, adapted to receive an alert or a message (e.g.,element 100F ofFIG. 3 ) of revocation fromproxy server 100, and get the relevant objects (e.g., variables, file stores, caches) fromdataflow monitoring module 220.Data revocation module 230 may subsequently delete the relevant data according tomessage 100F. - As shown in
FIG. 4 , at least oneclient 200 may include anotification module 240. According to some embodiments,data revocation module 230 may be configured to communicate withnotification module 240, to notify the results of a revocation to relevant users.Notification module 240 may, in turn be configured to transmit anotification 240A to at least one user regarding occurrence of data revocation, and relevant entities (e.g., variables, files, caches that have been affected). The notification may be transmitted, for example as a text message to an email account or messaging account of one or more relevant users. Additionally, or alternatively,notification module 240 may transmitnotification 240A todevelopment environment 30, to be presented to a user. - As shown in
FIG. 4 , at least oneclient 200 may include acode tracking module 250, configured to scan source code that is included in one or more ICDs 30A (e.g.,notebooks 30A) orSCEs 30B (e.g.,cells 30B) ofdevelopment environment 30.Code tracking module 250 may scan the source code continuously (e.g., periodically, or repeatedly over time). Alternatively,code tracking module 250 may scan the source code by demand, or after every instance of code execution. - According to some embodiments, during scan of the source code,
code tracking module 250 may look for specific code executions results, failures or snippets that contain or match a list of terms. This list of terms may, for example be retrieved fromproxy server 100 or from adatabase 80. - According to some embodiments,
code tracking module 250 may log each finding in according to a relevant log category. The log categories may include, for example specific instances of source-code execution, specific data-sets, specific data risks, and the like. - Additionally, or alternatively,
code tracking module 250 may scan the development environment's 30 source code (e.g., source code inICDs 30A and/orSCEs 30B) to find instances of code that are adapted to attempt to access distributedcomputing environment 50 while circumventingproxy server 100. Such access attempts may utilize, for example commands, action APIs and import-objects to retrieve data from stored by cloud-basedservices 530.Code tracking module 250 may log finding of such instances, and may transmit these findings toproxy server 100 in order log, notify and/or block relevant access or optimize a current or a subsequent execution after “learning” the relevant terms and code-snippets. - Additionally, or alternatively,
code tracking module 250 may collaborate withdata revocation module 230 to delete or purge data footprint of at least oneSCE 30B. - For example,
code tracking module 250 may record a lineage of at least one data element or data frame ofSCE 30B. Such lineage may include, or represent a history of locations and/or links to locations in which at least one version of the at least one data element ofSCE 30B was stored or kept. - For example,
code tracking module 250 may record copying of a first data frame into a second data frame, renaming (e.g., changing a filename) or reformatting (e.g., changing a file type) of a first data frame into one or more second data frames, inclusion of a first data frame (e.g., a data structure such as a table) in one or more second data frames (in a comma separated vector (CSV) file), uploading or saving the first data frame (e.g., onto a cloud storage service), and the like. -
Code tracking module 250 may subsequently be invoked bydata revocation module 230 as part of a mitigation action, to delete or purge a footprint of the first data frame or data element. Pertaining to the same example,code tracking module 250 may delete the footprint of the first data frame or data element by deleting, or disabling access to copies and/or versions of the first data frame and/or the one or more second data frames. - Additionally, or alternatively,
code tracking module 250 may be configured to perform run-time analysis of execution of one or more SCEs (e.g., cells) 30B of one or more ICDs 30A (e.g., notebooks). - For example, during processing or execution of SCEs (e.g., cells) 30B,
code tracking module 250 may continuously communicate with proxy server(s) 100 to monitor output logs produced by computingenvironment 50, and extract therefrom processing statistics such as processing time (e.g., duration of execution) of SCEs (e.g., cells) 30B.Code tracking module 250 may then analyze the processing statistics, for example by comparing the processing statistics (e.g., execution time) to historical processing statistics of previous executions, as elaborated herein. - As shown in
FIG. 4 , at least oneproxy server 100 may include a datarequest analysis module 110, adapted to analyze commands and features received from a client's 200code run module 210. Additionally, datarequest analysis module 110 may be configured to scan files that may be sent by one ormore clients 200 as part of command parameters. - According to some embodiments, data
request analysis module 110 may extract one or more features, such as dataset names and terms from the received command, and compare the one or more extracted features to historical similar commands, results and/or processing-periods and/or features associated specific users such as roles, policies, countries, Internet Protocol (IP) addresses and the like. Based on this comparison, datarequest analysis module 110 may check for abnormal behavior. - For example, data
request analysis module 110 may receive (e.g., frominput 7 ofFIG. 2 ) set ofpredetermined scripting rules 100D and/orruntime rules 100C. Additionally, or alternatively, and as elaborated herein, datarequest analysis module 110 may automatically generate a set ofscripting rules 100D and/orruntime rules 100C using a machine-learning (ML)model 113 based on data from adata enhancement module 150 and historical logs. - According to some embodiments, during preemptive scan of an
ICD 30A (e.g., a notebook),client 200 may uploadICD 30A toproxy server 100 to be scanned.Proxy server 100 may analyze the receivedICD 30 A code elements 30B in view of a repository ordatabase 80 of suspicious or malicious actions (denoted inFIG. 4 as database 80). In some embodiments,repository 80 may include a static repository 80A, that may include or store one or more definitions of actions that are identified as suspicious or malicious (e.g., from third-party databases or online data sources (e.g., GitHub)). - Additionally, or alternatively,
database 80 may include a dynamic repository 80B, that may store, or include definitions of actions that are classified as suspicious or malicious byML model 113, in a continuous effort to identify such actions or code lines inSCEs 30B. In some embodiments,ML model 113 may be configured to learn a pattern of history of actions of specific users and/or groups of users, as represented by code lines orCSEs 30B (e.g., cells) innotebooks 30A, to identify anomalous actions, and mitigate potential threats derived from such actions. - According to some embodiments,
ML model 113 may be configured to accumulate a subset ofSCEs 30B, each representing one or more data actions. The one or more data actions may include, for example actions of accessing data that is stored on the at least one cloud-basedplatform 50, actions of importing software modules, actions of running or executing specific commands or sets of commands, and the like.ML model 113 may, in turn, classify the subset of scripting code elements (e.g., in relation to at least one client module 200), to one or more data action classifications or categories. - For example, if a specific type of
SCE 30B, representing a data access action is frequently (e.g., beyond a predefined threshold) performed by a specific user via aspecific client 200, thenML model 113 may be configured to classify the specific type ofSCE 30B as a legitimate action for that user and/or client 200 (e.g., the corresponding action classification may be “legitimate data access”). - In a complementary manner, if a specific type of
SCE 30B is not frequently (e.g., below a predefined threshold) performed by a specific user via aspecific client 200, thenML model 113 may be configured to classify the specific type ofSCE 30B as an illegitimate action for that user and/orclient 200. (e.g., the corresponding action classification may be “illegitimate data access”). - It may be appreciated that other data action classifications are also possible, according to specific types of corresponding data actions.
- Based on this classification,
ML model 113 may determine at least onescripting rule 100D ormetadata rule 100D′, that may associate therelevant SCE 30B or action with a specific mitigation action ormessage 100F. - Additionally, or alternatively, data
request analysis module 110 may include a feature extraction module 112, configured to extract one or more action features 112A from at least onespecific SCE 30B. The one or more features 112A may include, for example features of data access actions, representing patterns of accessing data on a distributedcomputing platform 50. Feature extraction module 112 may introduce the one or more extracted features 112A toML model 113, which may classify thespecific SCE 30B according to one or more action classifications (e.g., data access action classifications), based on the one or more features 112A (e.g., the data action features 112A). - For example, data action features 112A may include data access features such as a frequency of accessing data by one or more specific users, historical success, or failure of accessing data by the specific users, latency of data retrieval and the like. During a training period,
ML model 113 may receive data action features 112A (e.g., data access features 112A) pertaining to a plurality ofSCEs 30B, and may be trained to predict or classify corresponding CSEs as including legitimate actions (e.g., legitimate data access requests) or illegitimate actions (e.g., illegitimate data access requests). In a subsequent inference stage,ML model 113 may receive action features 112A (e.g., data access features 112A) pertaining to aspecific target SCE 30B, and classify thetarget SCE 30B as including legitimate or illegitimate actions (e.g., legitimate, or illegitimate data access requests), according to the one or more action features 112A (e.g., data access features 112A) based on the training. - In other words,
proxy server 100 may be configured to verify, during run-time or just preceding run-time, at least one of (a) permission-based legitimacy of data retrieval by the at least one scannedSCE 30B, and (b) lack of malicious activity in the at least one scannedSCE 30B. - Additionally, or alternatively,
ML model 113 may produce a confidence score, representing pertinence of one ormore SCEs 30B to the one or more classifications. For example,ML model 113 may provide a confidence score representing the extent of which aspecific SCE 30B pertains to a specific action classification (e.g., legitimate, or illegitimate data access requests). Datarequest analysis module 110 may subsequently determine at least onescripting rule 100D based on the confidence score. For example, if an action represented bySCE 30B is classified byML model 113 as illegitimate with a high (e.g., beyond a predefined threshold) confidence score, thenscripting rule 100D may associate theSCE 30B with a first (strong)mitigation action 100G, such as denial of execution and/or revoking of data. In a complementary manner, if an action represented bySCE 30B is classified byML model 113 as illegitimate with a low (e.g., below a predefined threshold) confidence score, thenscripting rule 100D may associate theSCE 30B with a second (weak)mitigation action 100G, such as sending anotification 100F to an administrator's email. - Additionally, or alternatively,
ML model 113 may clusterSCEs 30B (or command lines, included inSCEs 30B) based on previously receivedSCEs 30B, and/orSCEs 30B that are already included in database orrepository 80. In some embodiments,ML model 113 may clusterSCEs 30B into one or more clusters (denoted 113′) to form a clustering model (also denoted 113′), and then classify one or moreincoming target SCE 30B as legitimate, or illegitimate based on theclustering model 113′. The terms “clusters” and “cluster model” may be used interchangeably in this context. - According to some embodiments, data
request analysis module 110 may calculate a similarity score, or similarity metric value 117 for one or more code lines of theSCEs 30B to corresponding code lines inrepository 80. For example, a first code line of afirst SCE 30B such as “% run -t mynb.ipynb” may have high similarity (e.g., a high similarity metric value 117) to a second code line such “%run” that may be stored inrepository 80 as a legitimate code line.Data request analysis 110 may apply a similarity threshold to associate these two code lines in thesame cluster 113′ of theclustering model 113′, and may subsequently classify the code line “%run -t mynb.ipynb” as legitimate, based on thisclustering 113′. - Additionally, or alternatively,
data request analysis 110 may be configured to clusterSCEs 30B intoclusters 113′ that include, or pertain to similar actions. Such actions may include, for example importing of modules, “pandas” data frame actions, special notebook actions implemented in notebook kernels (commonly referred to as “magic” actions), data retrieval actions and the like. - For example,
data request analysis 110 may calculate a frequency metric value, representing a frequency of appearance of terms or keywords in code lines ofSCE 30B. Additionally, or alternatively,data request analysis 110 may calculate a similarity metric value 117, representing similarity between two or more of the terms or keywords ofSCE 30B.ML model 113 may subsequently cluster the terms or keywords based on the calculated frequency metric value and/or the similarity metric value 117 to produce a cluster model (denoted 113′), that may be or may include a plurality of clusters (also denoted 113′). Eachcluster 113′ ofcluster model 113′ may represent terms that are close in meaning and/or utilization. - Additionally, or alternatively, for one or more (e.g., each)
cluster 113′ ofcluster model 113′,data request analysis 110 may calculate or generate sub clusters (denoted 113″), to indicate finer accuracy granularity of similar actions, using a similarity algorithm. For example, acluster 113′ of import actions may include afirst sub-cluster 113″ that pertains to importing “pandas” elements, and asecond sub-cluster 113″ that pertains to importing “JSON” elements. - Additionally, or alternatively, for one or more (e.g., each)
cluster 113′ and/orsub-cluster 113″,data request analysis 110 may calculate a ranking score 116, representing legitimacy of code lines orSCEs 30B included in thatcluster 113′ orsub-cluster 113″. - For example,
data request analysis 110 may calculate the ranking score 116 as a function of a frequency (e.g., the number of times) that arelevant SCE 30B (which may also be referred to herein as a “code line”, “term”, or “cell”) appears in a repository ofclient 200 and/or in third-party, online repositories. - Additionally, or alternatively, the ranking score 116 may be calculated as a function of the time (e.g., days) that has elapsed since a previous appearance of an
SCE 30B of thesame cluster 113′ in the scannednotebooks 30A. - For example, a ranking score 116 (denoted RankScore(i)) of a
specific SCE 30B (denoted T) may be calculated according to equation Eq. 1, below: -
- Where TotalTermFreq(i) may be the number of times that a specific term or sub-term “e.g.: import pandas” was found (e.g., counted) in the customer
historical notebooks 30A or code-scripts;
ActionTime(i) may be the time (e.g., in seconds) that the action took to perform (e.g., for code actions);
Maxall clusters [AllTerms(cluster)] may be the maximum amount of time that a term or a sub-term was found (e.g., counted) for thisspecific notebook 30A in historical runs (e.g., all historically recorded executions);
Max0≤i≤R [ActionTime(i)] may be the maximum amount of time (e.g., in seconds) that a term or sub-term took for thisspecific notebook 30A in historical runs (e.g., all historically recorded executions); and
log (dDays(i)+1) is a logarithm on the number of (days +1), that have elapsed since thisspecific notebook 30A (that had this term or sub-term ‘i’ ran until the present time. - According to some embodiments, data
request analysis module 110 may produce one or more run-time rules 100C and/or one ormore scripting rules 100D, based on theclustering 113′. Run-time rules 100C and/orscripting rules 100D may be stored inrepository 80, and may be used for continuous preemptive, and run-time analysis ofICDs 30A andSCEs 30B. - For example, as elaborated herein,
ML model 113 may classify or define acluster 113′ of actions as representing illegitimate actions (e.g., actions that include transferring of data to an untrusted location). Datarequest analysis module 110 may create ascripting rule 100D, andassociate scripting rule 100D with therelevant cluster 113′.Scripting rule 100D may be regarded as a rule in a sense that it may associate an identified condition (e.g., appearance of an illegitimate data transferal action), as defined by therelevant cluster 113′, with a mitigating action. In this example, mitigating action may include disallowing of execution of the illegitimate actions that are included in therelevant cluster 113′. - According to some embodiments, data
request analysis module 110 may be, or may include a monitoring module, adapted to monitor execution ofSCEs 30B of the one or more ICDs 30A by the at least one cloud-basedplatform 50. - Additionally, or alternatively, data
request analysis module 110 may include arule engine module 115, adapted to identify, during run-time or subsequent to execution, a problem in execution ofSCE 30B, according to a set of predefined runtime rules 100C, based on the monitoring. - For example, data
request analysis module 110 may perform run-time monitoring of logs generated by the at least one cloud-based platform. In another example, datarequest analysis module 110 may perform run-time monitoring of an order of execution of the scripting code elements. In another example, datarequest analysis module 110 may perform run-time monitoring of output of the executed scripting code elements. In another example, datarequest analysis module 110 may monitor execution statistics such as duration of execution of at least one scripting code element of the one or more interactive computing documents. - In another example, data
request analysis module 110 may perform a comparison between output statistics of a current execution of an interactive computing document and output statistics of a previous execution of the interactive computing document. - For example, data
request analysis module 110 may be configured to identify at least one difference between an output of aspecific SCE 30B at a first execution and an output of thespecific SCE 30A (e.g., thesame SCE 30B, or anotherSCE 30B containing similar code lines) at a second execution. - In another example, data
request analysis module 110 may be configured to identify at least one difference between a first metadata element (e.g., duration of execution, memory footprint, power consumption, etc.), associated with a specific scripting code element at a first execution, and a second, corresponding metadata element, associated with the specific scripting code element at a second execution. - Additionally, or alternatively, during run-time, data
request analysis module 110 may monitor run-time statistics and/or output of executed ICDs (cells) 30A. For example, datarequest analysis module 110 may record the processing time each cells, such as code cells, markup (HTML) cells, and/or raw cells, as commonly referred to in the art.Rule engine module 115 of Datarequest analysis module 110 may produce one or more run-time rules 100C pertaining to specific features or characteristics of the execution statistics and/or output of the execution. Pertaining to the same example,rule engine module 115 may produce a rule 110C that associates between a detected abnormal behaviour (e.g., when execution exceeds a predetermined threshold period) and a mitigation action (e.g., revocation of data). - Data
request analysis module 110 may analyze the one or more extracted action features to identify a suspicious or abnormal actions, such as an abnormal attempt to access data stored on computingenvironment 50. In a condition that such an abnormal or suspiciousdata action request 200B is identified,analysis module 110 may block the relevantdata action request 200B (e.g., from accessing , reading, writing, deleting, etc. relevant data on computing environment 50). - Additionally, or alternatively, data
request analysis module 110 may produce a message or log that includes an indication of the identified abnormal data action (e.g., abnormal data access). Datarequest analysis module 110 may transmit this message to be stored ondatabase 80. Additionally, or alternatively, datarequest analysis module 110 may communicate the message or log of abnormal behavior to an alerts'module 160. -
Alerts module 160 may in turn send a notification (e.g., an email message) to relevant computing systems e.g., vianotification module 240 of therelevant client 200. Additionally, or alternativelyalerts module 160 may communicate the notification of abnormal data action todata revocation module 230 of therelevant client 200, to revoke or delete the relevant data from storage associated with therelevant development environment 30. - Additionally, or alternatively,
rule engine module 115 may collaborate withalerts module 160 to communicate anotification message 100F to arelevant client module 200.Message 100F may include a representation of the identified execution problem (e.g.,rule 100C that has been triggered).Client 200 may subsequently displaynotification message 100F on a user interface (UI) of theuser development platform 30. - According to some embodiments, data
request analysis module 110 may include ascanning service 114.Scanning service 114 may interact with, and scan real-time,incoming SCEs 30B (e.g., data requests) fromclients 200, as well as historical requests, and logs of data access footprints and data actions stored ondatabase 80. - Additionally, or alternatively,
rule engine module 115 may initiate or perform at least onemitigation action 100G (e.g., other than producing anotification message 100F). For example,rule engine module 115 may communicate with arelevant client 200 and/or cloud-basedplatform 50, to block a user of the specific client module from performing a data action (e.g., accessing data stored on the at least one cloud-based platform 50). In another example,rule engine module 115 may collaborate withdata management module 140, and communicate with one or more organizational servers ordatabases 80, to change access permissions (e.g., permissions associated with a role of a user, permissions associated with a specific client module, etc.) to a specific cloud-basedplatform 50. In another example,rule engine module 115 may collaborate withtracking module 250 of arelevant client 200, to delete data footprints of the at least one executed scripting code element, as elaborated herein. In another example,rule engine module 115 may communicate with arelevant client 200 to disable a specific user of aspecific client module 200 from executingSCEs 30B in theiruser development platform 30.Additional mitigation actions 100G are also possible. - As elaborated herein, data
request analysis module 110 ofproxy server 100 may be adapted to monitor execution of theSCEs 30B of the one or more ICDs 30A by the at least one cloud-basedplatform 50. Additionally, or alternatively, datarequest analysis module 110 may be configured to produce a log data element, representing said monitored execution of the scripting code elements. Datarequest analysis module 110 may communicate the log data element to a relevant client module 200 (e.g., as anotification message 100F). Datarequest analysis module 110 may communicate the log data element in real-time or near real time so as to facilitate the interactivity quality ofICD 30A, during the development process. Therelevant client module 200 may, in turn display log data element (e.g.,notification message 100F) on a UI of theuser development platform 30. - It may be appreciated that
proxy server 100 of the present invention may allow real-time, cell-level logging and monitoring of execution of actions on a cloud-basedplatform 50, as they are incrementally called in adevelopment platform 30 on theclient 200 side. - According to some embodiments, an
ICD 30A (e.g., a notebook) may be requested by a user (e.g., by development environment 30) to be executed or scheduled for execution on distributed computing environment (e.g., cloud platform) 50.Client 200 may subsequently upload or transmitICD 30A to proxy server, either entirely (e.g., including allSCEs 30B ofICD 30A), or partially (e.g., including a subset ofSCEs 30B ofICD 30A), according to a predefined configuration. During a preemptive scan,scanning service 114 may be configured to scan the receivedSCEs 30B (e.g., code lines), according to the set ofpredetermined scripting rules 100D as elaborated herein (e.g., in relation toFIG. 3 ). - Additionally, or alternatively,
scanning service 114 may be configured to scan a subset of the receivedSCEs 30B (e.g., code lines), according to a predetermined configuration. - Additionally, or alternatively, the one or more ICDs 30A may include, or be associated with document metadata, such as size (e.g., in Bytes) of the
ICD 30A, a type ofcells ICD 30B in theICD 30A (e.g., code cells, markup cells, raw cells, as commonly referred to in the art), an identification of an owner of theICD 30A, an identification and/or associated permissions of a user who is currently requesting execution of theICD 30A, etc.Scanning service 114 ofproxy server 100 may scanICD 30A scan the document metadata according to a set ofpredetermined scripting rules 100D, which may be referred to herein asmetadata rules 100D′, in this context.Such scripting rules 100D (e.g.,metadata rules 100D′) may include limitations to metadata of the scannedICD 30A, such as a limitation of size of the executedICD 30A, a limitation of permission to executeICD 30A, a limitation of execution of specific actions of code lines and/or types ofCSEs 30B, and the like. Other limitations may also be possible. As elaborated herein,proxy server 100 may then encapsulate the one or more ICDs 30A and/or theworkflow data element 210A in one ormore data containers 120A, based on the scan of document metadata, e.g., conditioned thatICD 30A metadata successfully complies to the limitations ofscripting rules 100D (e.g.,metadata rules 100D′). - Additionally, or alternatively, during a run-time scan,
scanning service 114 may be configured to monitor execution of one ormore SCEs 30B in the receivedICD 30A, to identify SCE-level (e.g., cell-level) exceptions in the execution of the received ICD (e.g., notebook) 30A, and/or malicious or suspicious actions, according toruntime rules 100C, as elaborated herein (e.g., in relation toFIG. 3 ). As elaborated herein,scanning service 114 may communicate withcode tracking module 250 during run time, to notify a user and/or perform mitigation actions, in response to the identified exceptions or suspicious actions. - As elaborated herein, one or more (e.g., each) problematic code line or action found by
rules 100D and/or 100C may be associated with a mitigation action. The mitigation action may be invoked byalert message 100F, and may include, for example sending an email to an administrator user, returning a report to the userinteractive development environment 30 usingclient 200, deleting specific data footprints, disabling execution of code lines in the interactive development environment 30 (either permanently or for a predefined period of time), and the like. - According to some embodiments,
proxy server 100 may include adata management module 140.Data management module 140 may be configured to manage permissions, restrictions and/or mitigation actions according to specific users and/or roles. - For example,
data management module 140 may be configured to set, or change a mitigation action according to a user's identity. For example,data management module 140 may enableproxy server 100, subject to relevant permissions, to disallow execution of a suspectedSCE 30B by a first user, and allow execution of the suspectedSCE 30B by a second user, having elevated permissions. - In another example,
data management module 140 may allow dynamic adjustment of mitigation actions. For example,data management module 140 may enableproxy server 100 to run aspecific ICD 30A (e.g., a notebook) if less that a predefined number of problematic actions (or action with low severity) have been identified in therelevant ICD 30A. - Additionally, or alternatively, a user of
development environment 30 may utilizeclient 200 APIs to request manual analysis/scan of an IC D30A (e.g., notebook), using these modules to get a personal report. - Additionally, or alternatively,
scanning service 114 may be configured to add web features that are extracted bydata enhancement module 150, to determine or identify a data action (e.g., a data access)request 200B that is suspected to be a security breach.Scanning service 114 may implement anomaly detection and user behavior analysis to determine a risk and send messages on such cases todata access module 120, as elaborated herein. - According to some embodiments,
scanning service 114 may be configured to check data validation in case of data actions such as a data write access or a data delete access, to avoid a malicious of erroneous changes in a dataset stored onenvironment 50. - For example,
scanning service 114 may determine that a delete data access is valid by comparing, or analyzing extracted features (e.g., timestamps, dataset names, execution-code, etc.) of a currentdata action request 200B in view of extracted features of historical delete data access sequences.Scanning service 114 may compare such extracted features of a predefined number of actions before and/or after deletion, corresponding to current and historical delete data action requests 200B, to identify an anomalous deletion attempt. - In a similar example,
scanning service 114 may determine validity of a write/replace” by identifying a suspicious or anomalous write data access attempt by comparing extracted features such as column names or the new and old datasets and the value distribution and range per column or new and old datasets and determine whether to block the action and/or notify users. - As shown in
FIG. 4 , at least oneproxy server 100 may include adata access module 120, adapted to receive one or more ICDs 30A and/or at least oneworkflow data element 210A from aclient 200, and encapsulate the workflow data element and the one or more ICDs 30A in one ormore data containers 120A, based on the preemptive scan (e.g., conditioned that the preemptive scan was successful). - In other words, if a preemptive scan of an ICD (e.g., notebook) 30A is successful, in the sense that data
request analysis module 110 did not identify the one ormore notebooks 30A (and/ordata action requests 200B included in the one ormore notebooks 30A) as being suspicious or abnormal, thendata access module 120 may encapsulate the relevant one ormore notebooks 30A in adata container 120A. If, on the other hand, a preemptive scan of at least onenotebook 30A fails, in the sense that datarequest analysis module 110 did identify the at least onenotebook 30A as suspicious or abnormal, thendata access module 120 may refrain from encapsulating the relevant one ormore notebooks 30A in adata container 120A, and thus block anydata action request 200B included in the one ormore notebooks 30A. - Additionally, or alternatively, data
request analysis module 110 may block execution of one or more ICDs 30A (e.g., notebooks) based on outcome of the preemptive scan, by disallowing transmission ofICDs 30A (e.g., whether encapsulated indata container 120A or not) tocloud computing platform 510. - As known in the art, the term “container” (also commonly referred to as a “docker”) may be used herein to refer to a standard unit or package of software that may include computer code (e.g., source code included in
cells 30B) and relevant code dependencies required for execution of that computer code. Such dependencies may include, for example references to specific resources, (e.g., specific elements of distributed computing environment 50), databases, (e.g., database 80), configuration of distributed processing systems (e.g., kernels, and memory of distributed processing systems 520), configuration of cloud services (e.g., allocated storage in storage services 531), and the like. - It may be appreciated that such encapsulation of source code (e.g., source code included in
cells 30B) into a container ordocket 120A may allow an application to be quickly and reliably executed, from one computing environment to another (e.g., fromdevelopment environment 30 to distributed computing environment 50). - Additionally, or alternatively,
data access module 120 may receive (e.g., from a user ofdevelopment environment 30, via client 200) one or more configurations or adjustments tocontainer 120A (e.g., selection of processing units, selection of memory, etc.), and may adaptcontainer 120A accordingly. - For example, as known in the art, execution parameters may be used in computing languages to allow a user to provide arbitrary configuration information and/or setting variable values for executing computer code. According to some embodiments,
data access module 120 may receive from the at least one client module, at least one execution parameter value, associated with aspecific ICD 30A, and may inject the received execution parameter value as metadata into therelevant ICD 30A.Data access module 120 may then encapsulate the document metadata (e.g., including the injected execution parameter value) into the one ormore data containers 120A.Data access module 120 may subsequently transmit, or propagate (e.g., via the Internet)data containers 120A to the at least one cloud-basedplatform 50, which may, in turn execute at least oneSCE 30B of the one or more ICDs according to the document metadata (e.g., according to the injected execution parameter value). - According to some embodiments,
data access module 120 may be configured to transmit the one ormore data containers 120A to at least one cloud-based platform (e.g., distributed computing environment 50), to execute the one or more ICDs 30A. - Additionally, or alternatively,
data access module 120 may be configured to transmit the one or more data containers and/or the workflow data element to the at least one cloud-based platforms, to execute the one or more interactive computing documents according to the execution scheme. - As elaborated herein, at least one
cloud computing platform 510 of distributedcomputing environment 50 may be configured to execute, or run the one ormore ICDs 30A encapsulated indata containers 120A according to the execution scheme defined by theworkflow data element 210A as elaborated herein (e.g., in relation toFIG. 3 ). - According to some embodiments, proxy server may be configured to encapsulate
workflow data element 210A with the one or more ICDs 30A in the one ormore data containers 120A. Additionally, or alternatively, proxy server may be configured to transmitworkflow data element 210A separately from the one ormore data containers 120A ofICDs 30A (e.g., as separate data files) tocloud computing platform 510 of distributedcomputing environment 50. - According to some embodiments,
data access module 120 may be configured to encapsulateworkflow data element 210A separately from the one or more ICDs 30A. For example,data access module 120 may encapsulateworkflow data element 210A in afirst data container 120A, and then spawn or generate one or moresecond data containers 120A from thefirst data container 120A. Thesecond data containers 120A may be characterized by that (a) eachsecond data containers 120A may inherit configurations and/or definitions of thefirst data container 120A, and (b) one or more (e.g., each)container 120A of thesecond data containers 120A may encapsulate one or more ICDs 30A of the one or more receivedICDs 30A, based on the workflow data element. In other words, the spawning of the one or moresecond data containers 120A may manifest the execution scheme (e.g., order of execution, dependency of execution, parallelism of execution) of therelevant ICDs 30A. -
Data access module 120 may then transmit the one or moresecond data containers 120A to the at least one cloud-basedplatform 50 according to the execution scheme, as defined by theworkflow data element 210A. Additionally, or alternatively, the at least one cloud-basedplatform 50 may be configured to execute the one or morerelevant ICDs 30A, encapsulated in theirrespective containers 120A, according to the execution scheme, as defined by theworkflow data element 210A. - As shown in
FIG. 4 , at least oneproxy server 100 may include adata enhancement module 150. According to some embodiments,data enhancement module 150 may be, or may include a data mining engine, adapted to extract data from various data sources and databases, including for example online data sources, local and/or cloud based databases, and/or rules and terms that are added manually (e.g., viainput 7 ofFIG. 2 ). - According to some embodiments,
data enhancement module 150 may mine for data such as machine details per cloud provider/platform/solutions, and its cost, machine-learning and data processing code snippets from code repositories (e.g. Github/Blogs/StackOverflow.com).Data enhancement module 150 may provide the mined data to a dataaccess optimization module 130 to predict the best offer to process/retrieve/write data, as elaborated herein. - Additionally, or alternatively,
data enhancement module 150 may also extract or mine relevant data for the datarequest analysis module 110 for the prediction of a security risk or mistake. - According to some embodiments,
data enhancement module 150 may, as part of the data mining process, query security engines/tools/Apis/knowledgebases (e.g. exploitation/vulnerability databases, https://attack.mitre.org/, DarkReading website, open source list of security detection rules) as well as code repositories and blogs on specific data breach cases (e.g. in https://stackoverflow.com/, https://github.com/) and cloud platform documentations of data retrieval/processing/write to get a refence for valid code snippets.Data enhancement module 150 may then aggregate the mined data to a relevant list of bad/good terms for recommendations for the datarequest analysis module 110 to determine/predict a security/data-breach/data-exfiltration issue.Data enhancement module 150 may also build a set of recommendations per historical detected potential data breach/exfiltration using the extracted data and log of actions. - As shown in
FIG. 4 , at least oneproxy server 100 may include adata management module 140, configured to enable addition, updating and/or deletion of entities such as users, roles, policies, security-rules, recommendations, data-services, clusters, schedulers, code-scripts (on every cloud platform) and datasets and dataset actions, while managing connection to customer users-databases such as Lightweight Directory Access Protocol (LDAP), Active Directory (AD), Gsuite and the like. Data management module may include a configuration required to connect to all users, cloud providers, data services, and data sets in order to send commands and retrieve/delete/write data from/toenvironment 50. - Additionally, or alternatively,
data management module 140 may enable a user (e.g., viainput 7 ofFIG. 2 ) to define an automatic processing machine/machines-cluster/job/notebook/scripts scheduler, that will interact with the dataaccess optimization module 130, as elaborated herein. - Retrieval of data is being done by analyze the command(s) and compare it to all the historical commands that were sent during a cluster connection session to determine if there is a need to send the request again or simply retrieve the last result ,that was already cached on a cloud storage, while verifying the timestamp of the result to user parameters or force-retrieve parameter.
- According to some embodiments, data
access optimization module 130 may be configured to further analyze the data request or execution request that was analyzed by datarequest analysis module 110. In some embodiments, dataaccess optimization module 130 may split a receiveddata action request 200B to terms such as actions, codes, file-parameters and the like. Dataaccess optimization module 130 may analyze these terms in comparison to, or in view of similar historical terms (e.g., terms included in previous data action requests 200B). Such similarity (e.g., similarity metric value 117) may be calculated, for example by algorithms such as Term Frequency-Inverse Document Frequency (TF-IDF), cosine similarity, ML-based models, and the like. According to some embodiments, based on this analysis or comparison, dataaccess optimization module 130 may determine whether to retrieve the last action's results (e.g., operate as a cache) or to re-process it. - Additionally, or alternatively,
access optimization module 130 may calculate (e.g., via a regression model) the amount of data requested in case of retrieve data (or simply data size in case of write/delete) and determine, using a prediction model calculate beforehand (e.g., running every predetermined interval) the cheapest and/or fastest code or API and platform or solution to run the action, choosing the best options automatically by user definitions in data management module 140 (automatic/manual changes to the auto process) ,while deciding to create new data-cluster(s) or alter the cluster machine types or number of machines (might be on various cloud providers) to process the data and/or terminate/halt current active clusters/services on different cloud platforms/services, by calculating the optimal time for the cluster to work before termination and the size of the initial cluster on creation. - Additionally, or alternatively,
access optimization module 130 may calculate a computing machine type and/or a required number computing machines by employing a machine learning model and statistics regressions by extracting relevant terms from the action's parameters (can be a code snippet or code file or notebook file) and gathering the relevant parameters logged from theclient 200 code run module 210 (such as: ‘CUDA’, ‘GPU’, ‘TensorFlow’, ‘Keras’, “Memory failure”) and historical actions results and processing time. - Reference is now made to
FIG. 5 , which is a flow diagram, depicting a method of providing an interactive development platform in a distributedcomputing environment 50, according to some embodiments of the invention. - As shown in steps S1005 and S1010, embodiments of the invention may provide a proxy server module (e.g.,
proxy server 100 ofFIG. 4 ), adapted to interface at least one cloud-based platform (e.g., distributedcomputing environment 50 ofFIG. 4 ), and one or more client modules (e.g., client(s) 200 ofFIG. 4 ), operatively connected toproxy server 100, where eachclient module 200 is associated with a respective user development platform (e.g.,development environment 30 ofFIG. 3 ) such as Amazon's Sagemaker. - As shown in step S1015, at least one
client module 200 may receive, from the respectiveuser development platform 30, one or more ICDs 30A.ICDs 30A are also referred to herein as notebooks, such as Amazon Sagemaker notebook documents. The terms ICD and notebook may be used herein interchangeably. As known in the art, each ICD ornotebook 30A may include, or represent one ormore SCEs 30B, such as notebook cells, as commonly referred to in the art. The terms SCE, cells and code lines may be used herein interchangeably. - Additionally, or alternatively, the at least one
client module 200 may produce a workflow data element (e.g.,element 210A ofFIG. 4 ), that may define a scheme for executing the one or more ICDs, as elaborated herein (e.g., in relation toFIG. 4 ). - According to some embodiments, the at least one
client module 200 may transmit (e.g., via the Internet) the one or more ICDs 30A and/or the one or more includedSCEs 30B and/or theworkflow data element 210A toproxy server 100. As shown in step S1020,proxy server 100 may scan the one or moreSCEs elements 30B according to a set of rules. For example,proxy server 100 may perform a preemptive scan of the one ormore SCEs 30B according to a set ofpredetermined scripting rules 100D (e.g.,element 100D ofFIG. 3 ), as elaborated herein (e.g., in relation toFIG. 4 ). - As shown in step S1025,
proxy server 100 may encapsulate the one or more interactive computing documents in one or more data containers (e.g.,element 120A ofFIG. 4 ), based on the scan (e.g., provided that the scan was successful) as elaborated herein (e.g., in relation toFIG. 4 ). Additionally, or alternatively,proxy server 100 may encapsulateworkflow data element 210A with the one or more interactive computing documents in the one ormore data containers 120A. - As shown in step S1030,
proxy server 100 may subsequently transmit the one or more data containers to the at least one cloud-basedplatforms 50. Cloud-basedplatform 50 may be configured to execute at least oneSCE 30B of the one ormore ICDs 30A encapsulated in the one ormore containers 120A. Additionally, or alternatively,proxy server 100 may transmit the one or more data containers to the at least one cloud-based platforms, to execute the at least oneSCE 30B of the one or more notebooks, according to the execution scheme, as defined by theworkflow data element 210A. - As elaborated herein, embodiments of the invention may include a practical application for improving the technology of a computing system.
- For example, as explained herein (e.g., in relation to
FIG. 1 ),proxy server 100 and the one or moreclient computing devices 200 may be configured to streamline a development process that currently requires iterative collaboration between a plurality of disciplines and stakeholders such as research engineers, data scientists, data engineers and automation engineers. - Additionally, embodiments of the invention may provide this streamlined workflow in a manner that maintains data security, and requires optimal computation resources.
- For example, embodiments of the invention may enable following datasets that are transferred to a development environment and revoke or limit their use, while understanding source code and variables.
- In another example, embodiments of the invention may track transfer of
data requests 200B betweenclient 200 andproxy server 100 locally/remotely and learn malicious data behaviors, that may use the retrieved data and transfer/save/write/replace/process it, in order to delete data footprints in case of a data security incident (either by a malicious user or by mistake). Additionally, embodiments of the invention may employ rules, statistics and ML models to alert on a problematic or suspicious data-related behavior and block suspicious data-related actions. - In another example, embodiments of the invention may provide a multiple notebook development environment, that may facilitate integration of notebooks in a workflow, to follow and understand data access behavior and usage, while enabling scheduled tasks to be monitored and optimizes execution of data/machine-learning relevant commands/jobs/scripts.
- In another example, embodiments of the invention may optimize data action requests 200B (e.g., data access requests) by analyzing the
requests 200B and determining the amount of data requested and activating the optimal request for the optimal service and choosing the optimal machine type(s) and number, in terms of cost and speed, in order to get/write/delete the data or run the data related source code in the fastest and/or cheapest combination, that may require to automatically create new big data cluster or terminate other clusters on different cloud platforms, to get the data in a scalable manner by manually/automatically defining the number of data-nodes (size of cluster) according to the request and the optimal active time for each cluster on each cloud platform for optimal processing. - In another example, embodiments of the invention may provide recommendations on how to improve performance or harden data security by extracting relevant web related knowledge base and code snippets after identifying potential data exfiltration, while accessing all data sources and use various data services and data related actions and using development environment to investigate data.
- Unless explicitly stated, the method embodiments described herein are not constrained to a particular order or sequence. Furthermore, all formulas described herein are intended as examples only and other or different formulas may be used. Additionally, some of the described method embodiments or elements thereof may occur or be performed at the same point in time.
- While certain features of the invention have been illustrated and described herein, many modifications, substitutions, changes, and equivalents may occur to those skilled in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the invention.
- Various embodiments have been presented. Each of these embodiments may of course include features from other embodiments presented, and embodiments not specifically described may include various features described herein.
Claims (20)
1. A method of providing an interactive development environment, said method comprising:
providing a proxy server configured to:
operatively connect to one or more client modules, each associated with a respective user development platform;
receive, from at least one client module of the one or more client modules, one or more interactive computing documents, each comprising one or more scripting code elements;
scan the one or more scripting code elements according to a set of predetermined scripting rules;
encapsulate the one or more interactive computing documents in one or more data containers, based on said scan; and
transmit the one or more data containers to at least one cloud-based platform, wherein the at least one cloud-based platform is configured to execute at least one scripting code element of the one or more interactive computing documents.
2. The method of claim 1 , wherein the proxy server is further configured to:
monitor execution of the one or more scripting code elements of the one or more interactive computing documents by the at least one cloud-based platform; and
perform at least one mitigation action based on said monitoring, wherein said mitigation action is selected from a list consisting of: blocking a user of a specific client module from accessing the at least one cloud-based platform, changing an access permission of a role of the user of the specific client module to a specific cloud-based platform, deleting data footprints of the at least one scripting code element, and disabling a specific user of the specific client module from executing the one or more scripting code elements in the user development platform.
3. The method of claim 1 , further comprising:
producing, by at least one client module of the one or more client modules, a workflow data element, defining a scheme for executing the one or more interactive computing documents;
transmitting, by at least one client module, the workflow data element to the to the proxy server; and
transmitting, by the proxy server, the one or more data containers and the workflow data element to the at least one cloud-based platform, to execute the one or more interactive computing documents according to the scheme, as defined by the workflow data element.
4. A proxy server for providing an interactive development environment, said proxy server configured to:
receive, from at least one client module associated with a respective user development platform, one or more interactive computing documents, each comprising one or more scripting code elements;
accumulate a subset of scripting code elements, each representing one or more data actions;
classify the subset of scripting code elements, in relation to the at least one client module, to one or more classifications;
determine at least one scripting rule based on the one or more classifications; and
scan at least one scripting code element according to the at least one scripting rule.
5. The proxy server of claim 4 , further configured to:
receive, from the at least one client module, a workflow data element, defining a scheme for executing the one or more interactive computing documents;
encapsulate the workflow data element with the one or more interactive computing documents in the one or more data containers; and
transmit the one or more data containers to at least one cloud-based platform, to execute the at least one scripting code element of the one or more interactive computing documents according to the scheme, as defined by the workflow data element.
6. The proxy server of claim 5 , wherein the scheme comprises at least one of: (a) scheduling of execution of the one or more interactive computing documents; (b) an order of execution of the one or more interactive computing documents; (c) a condition for executing the one or more interactive computing documents; and (d) concurrence of execution of two or more interactive computing documents.
7. The proxy server of claim 5 , wherein the one or more interactive computing documents further comprise document metadata, and wherein the proxy server is further configured to:
scan the document metadata according to a set of predetermined metadata rules; and
encapsulate the workflow data element and the one or more interactive computing documents in the one or more data containers, based on said scan of the document metadata.
8. The proxy server of claim 5 , further configured to:
receive, from the at least one client module, at least one execution parameter value, associated with a specific interactive computing document;
inject the received execution parameter value as document metadata into the specific interactive computing document; and
further encapsulate document metadata in the one or more data containers, wherein the at least one cloud-based platform is further configured to execute the at least one scripting code element of the one or more interactive computing documents according to the document metadata.
9. The proxy server of claim 8 further comprising:
a feature extraction module, configured to extract one or more data action features from at least one specific scripting code element of the subset of scripting code elements; and
an ML based model configured to classify the specific scripting code element according to the one or more classifications, based on the one or more data action features.
10. The proxy server of claim 5 , further configured to
encapsulate the workflow data element and the one or more interactive computing documents by (i) encapsulating the workflow data element in a first data container, and (ii) spawning one or more second data containers, each encapsulating one or more interactive computing documents of the one or more interactive computing documents based on the workflow data element; and
transmit the one or more second data containers to the at least one cloud-based platform according to the scheme, as defined by the workflow data element.
11. The proxy server of claim 5 further comprising:
a monitoring module, adapted to monitor execution of the one or more scripting code elements of the one or more interactive computing documents by the at least one cloud-based platform; and
a rule engine module, adapted to identify a problem in the execution according to a set of predefined runtime rules, based on said monitoring.
12. The proxy server of claim 10 , wherein the rule engine module is further configured to communicate a notification message to a relevant client module, wherein the notification message comprises a representation of the identified execution problem, to be displayed on a user interface (UI) of the user development platform.
13. The proxy server of claim 10 , wherein the rule engine module is further configured to perform at least one mitigation action, selected from a list consisting of: blocking a user of a specific client module from accessing data stored on the at least one cloud-based platform, changing an access permission of a role of the user of the specific client module to a specific cloud-based platform, deleting data footprints of the at least one scripting code element, and disabling a specific user of the specific client module from executing the one or more scripting code elements in the user development platform.
14. The proxy server of claim 10 , wherein monitoring execution of the one or more scripting code elements comprises at least one of: run-time monitoring of logs generated by the at least one cloud-based platform; run-time monitoring of an order of the execution of the one or more scripting code elements; run-time monitoring of output of the execution of the one or more scripting code elements; monitoring duration of execution of the at least one scripting code element of the one or more interactive computing documents; and performing a comparison between output statistics of a current execution of an interactive computing document and output statistics of a previous execution of the interactive computing document.
15. The proxy server of claim 10 , wherein monitoring execution of the one or more scripting code elements comprises at least one of:
identifying at least one difference between an output of a specific scripting code element at a first execution and an output of the specific scripting code element at a second execution; and
identifying at least one difference between a first metadata element, associated with the specific scripting code element at the first execution and a second metadata element, associated with the specific scripting code element at the second execution.
16. The proxy server of claim 5 further comprising a monitoring module, adapted to:
monitor execution of the one or more scripting code elements of the one or more interactive computing documents by the at least one cloud-based platform;
produce a log data element, representing said monitored execution of the one or more scripting code elements; and
communicate the log data element to a relevant client module, wherein said relevant client module is configured to display said log data element on a UI of the user development platform.
17. The proxy server of claim 4 wherein the proxy server is further configured to scan the at least one scripting code element according to the set of predetermined scripting rules, to verify at least one of (a) permission-based legitimacy of data retrieval by the at least one scripting code element, and (b) lack of malicious activity in the at least one scripting code element
18. A proxy server for providing an interactive development environment, said proxy server configured to:
receive, from a client module, one or more interactive computing documents, each comprising one or more scripting code elements;
apply a machine learning (ML) based model on the one or more scripting code elements to determine at least one scripting rule;
scan the one or more scripting code elements according to the at least one scripting rule;
encapsulate the one or more interactive computing documents in one or more data containers, based on said scan; and
transmit the one or more data containers to at least one cloud-based platform, to execute at least one scripting code element of the one or more interactive computing documents.
19. The proxy server of claim 18 , further configured to:
monitor execution of the one or more scripting code elements of the one or more interactive computing documents by the at least one cloud-based platform; and
perform at least one mitigation action based on said monitoring, wherein said mitigation action is selected from a list consisting of: blocking a user of a specific client module from accessing the at least one cloud-based platform, changing an access permission of a role of the user of the specific client module to a specific cloud-based platform, deleting data footprints of the at least one scripting code element, and disabling a specific user of the specific client module from executing the one or more scripting code elements in the user development platform.
20. The proxy server of claim 18 , further configured to:
receive, from at least one client module of the one or more client modules, a workflow data element, defining a scheme for executing the one or more interactive computing documents; and
transmit the one or more data containers and the workflow data element to the at least one cloud-based platform, to execute the one or more interactive computing documents according to the scheme, as defined by the workflow data element.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/982,981 US20230054226A1 (en) | 2021-08-22 | 2022-11-08 | System and method of providing an interactive development platform in a distributed computing environment |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202163235747P | 2021-08-22 | 2021-08-22 | |
US17/728,715 US11513772B1 (en) | 2021-08-22 | 2022-04-25 | System and method of providing an interactive development platform in a distributed computing environment |
US17/982,981 US20230054226A1 (en) | 2021-08-22 | 2022-11-08 | System and method of providing an interactive development platform in a distributed computing environment |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/728,715 Continuation US11513772B1 (en) | 2021-08-22 | 2022-04-25 | System and method of providing an interactive development platform in a distributed computing environment |
Publications (1)
Publication Number | Publication Date |
---|---|
US20230054226A1 true US20230054226A1 (en) | 2023-02-23 |
Family
ID=84230846
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/728,715 Active US11513772B1 (en) | 2021-08-22 | 2022-04-25 | System and method of providing an interactive development platform in a distributed computing environment |
US17/982,981 Abandoned US20230054226A1 (en) | 2021-08-22 | 2022-11-08 | System and method of providing an interactive development platform in a distributed computing environment |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/728,715 Active US11513772B1 (en) | 2021-08-22 | 2022-04-25 | System and method of providing an interactive development platform in a distributed computing environment |
Country Status (1)
Country | Link |
---|---|
US (2) | US11513772B1 (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20230328037A1 (en) * | 2022-04-07 | 2023-10-12 | Oracle International Corporation | Cloud Based Machine Learning Notebook Data Loss Prevention |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190281052A1 (en) * | 2018-03-08 | 2019-09-12 | Auton, Inc. | Systems and methods for securing an automotive controller network |
CN110839007A (en) * | 2018-08-17 | 2020-02-25 | 中国移动通信有限公司研究院 | Cloud network security processing method and device and computer storage medium |
US20210029089A1 (en) * | 2019-07-23 | 2021-01-28 | Microsoft Technology Licensing, Llc | Enforcing security policies on client-side generated content in cloud application communications |
US20210109841A1 (en) * | 2019-10-15 | 2021-04-15 | International Business Machines Corporation | Application containerization based on trace information |
US20220164167A1 (en) * | 2020-11-24 | 2022-05-26 | Kinaxis Inc. | Systems and methods for embedding a computational notebook |
US20220318049A1 (en) * | 2021-03-30 | 2022-10-06 | International Business Machines Corporation | Program context migration |
Family Cites Families (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8898324B2 (en) | 2010-06-24 | 2014-11-25 | International Business Machines Corporation | Data access management in a hybrid memory server |
US8935804B1 (en) | 2011-12-15 | 2015-01-13 | United Services Automobile Association (Usaa) | Rules-based data access systems and methods |
US10360565B2 (en) * | 2012-05-18 | 2019-07-23 | Kofax, Inc. | System and method for providing a universal endpoint address schema to route documents and manage document workflows |
US9053329B2 (en) | 2012-05-24 | 2015-06-09 | Lockbox Llc | Systems and methods for validated secure data access |
US20140173618A1 (en) | 2012-10-14 | 2014-06-19 | Xplenty Ltd. | System and method for management of big data sets |
US9760602B1 (en) | 2014-10-29 | 2017-09-12 | Databricks Inc. | System for exploring data in a database |
US9769032B1 (en) | 2015-03-20 | 2017-09-19 | Databricks Inc. | Cluster instance management system |
US9699205B2 (en) | 2015-08-31 | 2017-07-04 | Splunk Inc. | Network security system |
US9935772B1 (en) | 2016-02-19 | 2018-04-03 | Vijay K Madisetti | Methods and systems for operating secure digital management aware applications |
US11019101B2 (en) | 2016-03-11 | 2021-05-25 | Netskope, Inc. | Middle ware security layer for cloud computing services |
US20180191759A1 (en) | 2017-01-04 | 2018-07-05 | American Express Travel Related Services Company, Inc. | Systems and methods for modeling and monitoring data access behavior |
US10474501B2 (en) | 2017-04-28 | 2019-11-12 | Databricks Inc. | Serverless execution of code using cluster resources |
WO2020097522A1 (en) * | 2018-11-08 | 2020-05-14 | airSlate Inc. | Automated electronic document workflows |
-
2022
- 2022-04-25 US US17/728,715 patent/US11513772B1/en active Active
- 2022-11-08 US US17/982,981 patent/US20230054226A1/en not_active Abandoned
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190281052A1 (en) * | 2018-03-08 | 2019-09-12 | Auton, Inc. | Systems and methods for securing an automotive controller network |
CN110839007A (en) * | 2018-08-17 | 2020-02-25 | 中国移动通信有限公司研究院 | Cloud network security processing method and device and computer storage medium |
US20210029089A1 (en) * | 2019-07-23 | 2021-01-28 | Microsoft Technology Licensing, Llc | Enforcing security policies on client-side generated content in cloud application communications |
US20210109841A1 (en) * | 2019-10-15 | 2021-04-15 | International Business Machines Corporation | Application containerization based on trace information |
US20220164167A1 (en) * | 2020-11-24 | 2022-05-26 | Kinaxis Inc. | Systems and methods for embedding a computational notebook |
US20220318049A1 (en) * | 2021-03-30 | 2022-10-06 | International Business Machines Corporation | Program context migration |
Non-Patent Citations (1)
Title |
---|
Translated CN 110839007-A (Year: 2020) * |
Also Published As
Publication number | Publication date |
---|---|
US11513772B1 (en) | 2022-11-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11924230B2 (en) | Individual device response options from the monitoring of multiple devices | |
AU2016204068B2 (en) | Data acceleration | |
US11184380B2 (en) | Security weakness and infiltration detection and repair in obfuscated website content | |
US10681060B2 (en) | Computer-implemented method for determining computer system security threats, security operations center system and computer program product | |
JP2020510926A (en) | Intelligent security management | |
EP3262815A1 (en) | System and method for securing an enterprise computing environment | |
CN117321584A (en) | Processing management of high data I/O ratio modules | |
US11762689B2 (en) | Message queue protocol for sequential execution of related task routines in many task computing | |
Xiao et al. | Detection and prevention of code injection attacks on HTML5-based apps | |
US20230054226A1 (en) | System and method of providing an interactive development platform in a distributed computing environment | |
Ouared et al. | Deepcm: Deep neural networks to improve accuracy prediction of database cost models | |
US20230334160A1 (en) | Systems and methods for software security analysis | |
US20220366056A1 (en) | Computer security using zero-trust principles and artificial intelligence for source code | |
Zhang et al. | Device-centric federated analytics at ease | |
US20210209227A1 (en) | System and method for defending applications invoking anonymous functions | |
US20230018068A1 (en) | Methods and systems for locating anomalous query activity on data stores | |
US11956245B1 (en) | Intelligent data protection | |
US11973784B1 (en) | Natural language interface for an anomaly detection framework | |
Shi et al. | Security-based code smell definition, detection, and impact quantification in Android | |
US20230044695A1 (en) | System and method for a scalable dynamic anomaly detector | |
Kulhavy | Efficient Collection and Processing of Cyber Threat Intelligence from Partner Feeds | |
Singh | Improving the Processing of Large Volume of Data Using Hadoop | |
WO2022215077A1 (en) | System and method for identifying cyber-attacks on a computing device of a protected network | |
Geistler | Automatically Detecting Malicious GitHub Actions | |
Gao | Effective and Efficient Forensic Analysis via System Monitoring |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: DATAPLATE LTD., ISRAEL Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:GROSS, AMIT;REEL/FRAME:061694/0161 Effective date: 20220420 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |