WO2023028996A1

WO2023028996A1 - Methods and devices for ensuring the reproducibility of software systems

Info

Publication number: WO2023028996A1
Application number: PCT/CN2021/116475
Authority: WO
Inventors: Boyuan Chen; Mingzhi WEN; Yong Shi; Zhenming JIANG
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2021-09-03
Filing date: 2021-09-03
Publication date: 2023-03-09

Abstract

Methods and devices are described for validating reproducibility of a machine learning model. Upon the initiation of a first machine learning model training instance, a training profile containing random values used in the first machine learning training instance is obtained. Upon initiation of a second machine learning model training instance, an interception module is used to automatically intercept non-determinism introducing functions called during the training process, retrieve corresponding random values from the training profile and use the retrieved random values in the second machine learning model training instance. In response to validating that the results returned from the first and second machine learning model training instances are identical, the machine learning model is validated as a reproducible machine learning model.

Description

METHODS AND DEVICES FOR ENSURING THE REPRODUCIBILITY OF SOFTWARE SYSTEMS

TECHNICAL FIELD

The present disclosure relates to software systems, in particular, methods and devices for ensuring the reproducibility of software systems, especially of machine learning systems.

BACKGROUND

Machine learning is a branch of artificial intelligence that is widely adopted to solve problems in domains ranging from facial recognition, speech recognition, medical diagnostics and autonomous cars, among others. Using statistical techniques, machine learning algorithms learn from training datasets to find patterns in data and make predictions based on learned models. Model performance can be evaluated using a range of metrics, including accuracy, precision and recall, among others and is strongly dependent on the input training data and system setup.

Machine learning model performance continues to improve, with development efforts focusing on advances in algorithms, training data and tuning parameters. It is important to verify that new developments are reliable and reproducible. While accuracy of a trained model increases, reproducibility of a trained model can be elusive. Reproducibility of a trained model refers to whether two models that have identical machine learning architecture and that are trained using identical code and training data generate the same predictions. Low reproducibility of trained models can result in wasted time and resources in re-training models or attempting to replicate model setup or random seed parameters, particularly as this information is not often shared publically or easily available. Reproducibility may also be hindered by variation in software versions and hardware elements, if models are run on different equipment.

Reproducibility of machine learning models is a crucial property, enabling third parties to inspect or audit machine learning models and subsequent decision processes based on model outputs. Developers are often asked to provide documentation detailing experiment setup, training data and parameters to allow others to build on the findings and encourage transparency, however, due to the “black box” nature of artificial neural networks it may not be easy to include values of random seeds or other random values used in the training process. Re-training large machine learning models can be costly, requiring large computational resources. Bypassing the time-consuming and energy intensive process of re-training large models can expedite research, allowing developers to build on and improve previous models quickly and efficiently. More efficient re-training of large models also has environmental benefits, eliminating wasteful computation and helping to reduce associated carbon emissions from data centers.

Machine learning models such as artificial neural networks make use of random elements in training processes such as data shuffling, weight initialization, and batch ordering, among others. Incorporating random elements in model training is beneficial for generating robust and accurate models. One existing approach to controlling randomness in the training process includes manually pre-setting all of the random seeds for all the dependent software packages used in the training process. This approach has disadvantages such as: (1) As the randomness would impact the performance of machine learning models, it is not trivial to select a good set of random seeds. A poor set of random seeds may cause the training process to converge to a local minima resulting in lower performance. (2) To preset the random seeds, developers need to examine the documentation of relevant software packages and instrument the code base. As the software evolves, it can be very costly and time consuming to maintain the instrumented code.

Another existing approach to controlling randomness in the training process includes recording the random states in a first training process and applying the recorded values during the second training process. This approach could be applied to certain software packages, such as numpy. A disadvantage of this approach is that many widely used and essential software packages such as scikit-learn, pytorch, and tensorflow do not include such features.

Accordingly, it would be useful to provide a solution that can control randomness in software systems without relying on manual seed input or code instrumentation.

SUMMARY

In various examples, the present disclosure describes methods and devices in which a software module is engaged prior to initiating a machine learning model training instance, to enable the automatic detection, recording, retrieval and use of random values generated during machine learning model training. In particular, the random values that are generated by system level non-deterministic functions and used during an initial machine learning training are recorded and stored in a training profile for use in future training instances using an interception module. In the event that the machine learning model training needs to be reproduced, the interception module can access the training profile during a second machine learning model training instance, and automatically retrieve and apply the random values used in the first machine learning model training instance to the second machine learning model training instance. The interception module operates independently from any software packages responsible for introducing the random values into the machine learning training. As a result, the random states can be recorded and retrieved at the system level with no need to instrument the machine learning training code.

In various examples, the present disclosure provides the technical effect that a trained machine learning model is obtained, along with an output file disclosing the random values generated during the initial machine learning model training instance. The trained machine learning model can therefore be reproduced in subsequent machine learning model training instances using the same input data and training setup and by replacing the random values generated by system level functions with stored values.

A technical advantage of examples of the disclosed methods and systems is that the disclosed methods and devices help to ensure that best practices for conducting machine learning training are maintained whereby random values for machine learning training are not manually set and the machine learning training have freedom to explore optimal solutions.

In some example aspects, the present disclosure describes a method for validating reproducibility of a machine learning model. The method includes: obtaining a training profile containing random values used for a first machine learning model training instance to train the machine learning model; performing a second machine learning model training instance to train the machine learning model using the random values stored within the training profile by: initiating the second machine learning training model instance; intercepting non-determinism introducing functions called during the second machine learning model training instance; retrieving, from the training profile, corresponding random values generated by the non-determinism introducing functions during the first machine learning model training instance; and using the retrieved random values in the second machine learning model training instance; in response to validating that the results returned from the first and second machine learning model training instances are identical, storing the machine learning model as a validated reproducible machine learning model.

In the preceding example aspect of the method, the method may further include, wherein obtaining a training profile comprises generating the training profile by: initiating a first machine learning training model instance; intercepting the non-determinism introducing functions called during the first machine learning model training instance; obtaining random values generated from the non-determinism introducing functions called during the first machine learning model training instance; and storing the random values generated from the non-determinism introducing functions called during the first machine learning model training instance into the training profile.

In the preceding example aspect of the method, the method may further include: profiling, at a system level, non-determinism introducing functions used during the first machine learning model training instance, wherein the non-determinism introducing function profile is used to intercept the non-determinism introducing functions during the first or the second machine learning model training instance.

In the preceding example aspect of the method, wherein profiling the non-determinism introducing functions at the system level, the method may further include: extracting all system level function calls used during the first machine learning model training instance; identifying, from the extracted system level function calls, the non-determinism introducing functions via keyword-based heuristics; and storing a list of the identified non-determinism introducing functions in the non-determinism introducing function profile.

In any of the preceding example aspects of the method, wherein performing the second machine learning model training instance, the method may further include: searching for the training profile containing random values generated from the non-determinism introducing functions called during the first machine learning model training instance.

In any of the preceding example aspects of the method, wherein the non-determinism introducing functions are system level functions.

In any of the preceding example aspects of the method, the method may include: in response to determining that the results returned from the first and second machine learning model training instances are different: updating the training profile by: searching for additional non-determinism introducing functions using keyword-based heuristics; adding the additional non-determinism introducing functions to update a non-determinism introducing function profile; using the updated non-determinism introducing function profile to identify and intercept the additional non-determinism introducing functions during the first machine learning model training instance; and storing the random values generated from the additional non-determinism introducing functions in the training profile; repeating the second machine learning model training instance using the updated training profile.

In any of the preceding example aspects of the method, the method may further include: pointing a computer environment variable to a dynamic library to give precedence to the instructions contained within dynamic library over other libraries during the execution of the first and second machine learning model training instances; and intercepting the non-determinism introducing functions called during the first and second machine learning model training instances using the dynamic library.

In the preceding example aspect of the method, wherein the interception module leverages an API hook mechanism to load the dynamic library for intercepting the non-determinism introducing functions called during the first and second machine learning model training instances.

In any of the preceding example aspects of the method, the method may further include: storing, in a system log, information indicating that the training profile generated during a first machine learning model training instance was applied to a second machine learning model training instance.

In some example aspects, the present disclosure describes a device for validating reproducibility of a machine learning model. The device includes a processing unit configured to execute instructions to cause the device to: obtain a training profile containing random values used for a first machine learning model training instance to train the machine learning model; perform a second machine learning model training instance to train the machine learning model using the random values stored within the training profile by: initiating the second machine learning training model instance; intercepting non-determinism introducing functions called during the second machine learning model training instance; retrieving, from the training profile, corresponding random values generated by the non-determinism introducing functions during the first machine learning model training instance; and using the retrieved random values in the second machine learning model training instance; in response to validating that the results returned from the first and second machine learning model training instances are identical, store the machine learning model as a validated reproducible machine learning model.

In the preceding example aspect of the device, the processing unit may be further configured to generate a training profile by executing the instructions to cause the device to: initiate a first machine learning training model instance; intercept non-determinism introducing functions called during the first machine learning model training instance; obtain random values generated from the non-determinism introducing functions called during the first machine learning model training instance; and store the random values generated from the non-determinism introducing functions called during the first machine learning model training instance into the training profile.

In the preceding example aspect of the device, the processing unit may be further configured to execute the instructions to cause the device to: profile, at a system level, non-determinism introducing functions used during the first machine learning model training instance, wherein the non-determinism introducing function profile is used to intercept the non-determinism introducing functions during the first and the second machine learning model training instances.

In the preceding example aspect of the device, wherein in profiling the non-determinism introducing functions at the system level, the processing unit may be further configured to execute the instructions to cause the device to: extract all system level function calls; identify, from the extracted system level function calls, the non-determinism introducing functions via keyword-based heuristics; and store a list of the identified non-determinism introducing functions in a non-determinism introducing function profile.

In any of the preceding example aspects of the device, wherein in performing the second machine learning model training instance, the processing unit may be further configured to execute the instructions to cause the device to: search for an existing training profile containing random values generated from the non-determinism introducing functions called during the first machine learning model training instance.

In any of the preceding example aspects of the device, wherein in response to determining that the results returned from the first and second machine learning model training instances are not identical, the processing unit may be further configured to execute the instructions to cause the device to: update the training profile by: searching for additional non-determinism introducing functions using keyword-based heuristics; adding additional non-determinism introducing functions to update a non-determinism introducing function profile; using the updated non-determinism introducing function profile to identify and intercept the additional non-determinism introducing functions during the first machine learning model training instance; and storing the random values generated from the additional non-determinism introducing functions in the training profile; repeat the second machine learning model training instance using the updated training profile.

In any of the preceding example aspects of the device, the processing unit may be further configured to execute the instructions to cause the device to: point a computer environment variable to a dynamic library to give precedence to the instructions contained within the dynamic library over other libraries during the execution of the first and second machine learning model training instances; and intercept the non-determinism introducing functions called during the first and second machine learning model training instances using the dynamic library.

In the preceding example aspect of the device, wherein the interception module leverages an API hook mechanism to load the dynamic library for intercepting the non-determinism introducing functions called during the first or second machine learning model training instances.

In any of the preceding example aspects of the device, the processing unit may be further configured to execute the instructions to cause the device to: store, in a system log, information indicating that the training profile generated during the first machine learning model training instance was applied to the second machine learning model training instance.

In some example aspects, the present disclosure describes a computer readable medium storing instructions thereon. The instructions, when executed by a processing unit of a device, cause the device to: validate a machine learning model, the machine learning model validation comprising: obtaining a training profile containing random values used for a first machine learning model training instance to train the machine learning model; performing a second machine learning model training instance to train the machine learning model using the random values stored within the training profile by: initiating the second machine learning training model instance; intercepting non-determinism introducing functions called during the second machine learning model training instance; retrieving, from the training profile, corresponding random values generated by the non-determinism introducing functions during the first machine learning model training instance; and using the retrieved random values in the second machine learning model training instance; in response to validating that the results returned from the first and second machine learning model training instances are identical, storing the machine learning model as a validated reproducible machine learning model.

In another example aspect, the present disclosure describes a computer readable medium having instructions encoded thereon, wherein the instructions, when executed by a processing unit of a system, cause the system to perform any of the preceding example aspects of the method.

In another example aspect, the present disclosure describes a computer program including instructions which, when the program is executed by a computer, cause the computer to carry out any of the preceding example aspects of the method.

BRIEF DESCRIPTION OF THE DRAWINGS

Reference will now be made, by way of example, to the accompanying drawings which show example embodiments of the present application, and in which:

FIG. 1 is a block diagram of an example computing system which may be used to implement examples of the present disclosure;

FIG. 2 is a flowchart illustrating an example method for profiling the non-determinism introducing functions invoked during machine learning model training, in accordance with examples of the present disclosure;

FIG. 3 is a is a block diagram illustrating an example interception module architecture, in accordance with examples of the present disclosure;

FIG. 4 is a flowchart illustrating an example method for validating the reproducibility of a machine learning model, in accordance with examples of the present disclosure; and

FIG. 5 is a flowchart illustrating an example method for generating a training profile, in accordance with examples of the present disclosure.

Similar reference numerals may have been used in different figures to denote similar components.

DESCRIPTION OF EXAMPLE EMBODIMENTS

In various examples, the present disclosure describes methods and devices that help to address the problem of machine learning model reproducibility, by automatically capturing the random values used by machine learning algorithms during a first training instance and applying those same random values in a second training instance. More specifically, an interception module is used to intercept system level non-determinism introducing functions called during machine learning model training such that information on random values is obtained seamlessly without consideration to which software packages have introduced the randomness. Further, training of the reproducible machine learning model is accomplished without manually setting random seeds, or instrumenting machine learning model source code.

To assist in understanding the present disclosure, some terminology is first introduced. In computer programming, a nondeterministic algorithm is an algorithm that may produce different outcomes when run multiple times even with the same inputs. One reason that nondeterministic algorithms exhibit different behaviors is due to their probabilistic nature, in that they employ elements of randomness in their logic.

Machine learning algorithms such as artificial neural networks make use of random elements in training processes such as data shuffling, weight initialization, and batch ordering, among others. Incorporating random elements in model training is beneficial for generating robust and accurate models. Within the context of machine learning training algorithms, a nondeterminism-introducing function is a function called during the machine learning training process that generates random values, and causes the model performance to be nondeterministic.

Random numbers are typically generated on computers using a pseudorandom number generator. The pseudorandom number generator generates a sequence of numbers which appear random, but the process is a deterministic function, and will return the same sequence of random numbers if the same starting point is used. This starting point is referred to as a random value. In machine learning training algorithms, fixing the random value is a common way to control randomness and ensure models are reproducible. Reproducibility of a trained model refers to whether two models that have identical machine learning architecture and that are trained using identical code and training data generate the same predictions. One challenge associated with manually fixing random values in machine learning training algorithms, is that it can be difficult to choose an appropriate value. Poorly set values may negatively impact model performance, potentially leading models to converge at local minima instead of an optimal solution. Another drawback to presetting random values is that it requires developers to instrument the code, and maintain this instrumented code base over time.

The present disclosure describes examples that may help to address some or all of the above drawbacks of existing technologies.

In the present disclosure, reproducibility of a machine learning model means that a machine learning model is trained (e.g., starting from a random initiation) in a first training, to obtain a first trained model that generates certain results (e.g., certain prediction outputs) . Then, after re-initiating the machine learning model, a second training of the same machine learning model (i.e., having the same machine learning architecture) is performed to obtain a second trained model that generates the same (or substantially the same) results as the first trained model.

For simplicity, the following discussion will describe some examples in the context of training machine learning systems. However, it should be understood that the present disclosure is also applicable to other software applications which are impacted by randomness, and where randomness is a factor impacting reproducibility, among other possibilities.

FIG. 1 is a block diagram illustrating a simplified example implementation of a computing system 100 that is suitable for implementing embodiments described herein. Examples of the present disclosure may be implemented in other computing systems, which may include components different from those discussed below. Although FIG. 1 shows a single instance of each component, there may be multiple instances of each component in the computing system 100. The computing system 100 may be used to execute instructions for training a machine learning model, using any of the examples described above. The computing system 100 may also to execute the trained machine learning model, or the trained machine learning model may be executed by another computing system.

Although FIG. 1 shows a single instance of each component, there may be multiple instances of each component in the computing system 100. Further, although the computing system 100 is illustrated as a single block, the computing system 100 may be a single physical machine or device (e.g., implemented as a single computing device, such as a single workstation, single consumer device, single server, etc. ) , or may comprise a plurality of physical machines or devices (e.g., implemented as a server cluster) . For example, the computing system 100 may represent a group of servers or cloud computing platform providing a virtualized pool of computing resources (e.g., a virtual machine, a virtual server) .

The computing system 100 includes at least one processing unit 102, such as a processor, a microprocessor, a digital signal processor, an application-specific integrated circuit (ASIC) , a field-programmable gate array (FPGA) , a dedicated logic circuitry, a dedicated artificial intelligence processor unit, a graphics processing unit (GPU) , a tensor processing unit (TPU) , a neural processing unit (NPU) , a hardware accelerator, or combinations thereof.

The computing system 100 may include an optional input/output (I/O) interface 104, which may enable interfacing with an optional input device 106 and/or optional output device 108. In the example shown, the optional input device 106 (e.g., a keyboard, a mouse, a microphone, a touchscreen, and/or a keypad) and optional output device 108 (e.g., a display, a speaker and/or a printer) are shown as optional and external to the computing system 100. In other example embodiments, there may not be any input device 106 and output device 108, in which case the I/O interface 104 may not be needed.

The computing system 100 may include an optional network interface 110 for wired or wireless communication with other computing systems (e.g., other computing systems in a network) . The network interface 110 may include wired links (e.g., Ethernet cable) and/or wireless links (e.g., one or more antennas) for intra-network and/or inter-network communications. For example, the network interface 110 may enable the computing system 100 to access training data samples from an external database, or a cloud-based data center (among other possibilities) where training datasets are stored.

The computing system 100 may include a memory 112, which may include a volatile or non-volatile memory (e.g., a flash memory, a random access memory (RAM) , and/or a read-only memory (ROM) ) . The non-transitory memory 112 may store instructions for execution by the processing unit 102, such as to carry out examples described in the present disclosure. For example, the memory 112 may store instructions for implementing any of the networks and methods disclosed herein. The memory 112 may include other software instructions, such as for implementing an operating system and other applications/functions. The memory 112 may also include data 114, such as trained parameters (e.g., weight values) of a neural network.

In some examples, the computing system 100 may also include an electronic storage unit (not shown) , such as a solid state drive, a hard disk drive, a magnetic disk drive and/or an optical disk drive. In some examples, data and/or instructions may be provided by an external memory (e.g., an external drive in wired or wireless communication with the computing system 100) or may be provided by a transitory or non-transitory computer-readable medium. Examples of non-transitory computer readable media include a RAM, a ROM, an erasable programmable ROM (EPROM) , an electrically erasable programmable ROM (EEPROM) , a flash memory, a CD-ROM, or other portable memory storage. The components of the computing system 100 may communicate with each other via a bus, for example.

FIG. 2 is a flowchart illustrating an example method 200 for profiling system-level non-determinism introducing functions used during the first machine learning model training instance. In accordance with examples of the present disclosure. The method 200 may be performed by the computing system 100. For example, the processing unit 102 may execute computer readable instructions (which may be stored in the memory 112) to cause the computing system 100 to perform the method 200. The method 200 may be performed using a single physical machine (e.g., a workstation or server) , a plurality of physical machines working together (e.g., a server cluster) , or cloud-based resources (e.g., using virtual resources on a cloud computing platform) .

At 202, instructions are executed to extract a list of system level function calls used during a first machine learning model training instance. As an example, open source tools (e.g., strace in Linux, Process Monitor in Windows, Sysdig across all platforms, etc. ) may be used to extract a list of system function calls using a technique known as dynamic profiling.

At 204, from the list of extracted system level function calls, a list of non-determinism introducing functions are identified via keyword based heuristics. As an example, functions such as getrandom or openat ( “/dev/urandom” ) may be identified through a heuristic keyword search using the word “random” .

At 206, a list of non-determinism introducing functions is assembled and stored in a non-determinism introducing function profile. The non-determinism introducing function profile is used to intercept non-determinism introducing functions during the first or second machine learning model training instance, in that the non-determinism introducing functions contained within the function profile inform the interception module 300 of which system level functions should be intercepted during the first or second machine learning model training instance.

In some examples, other methods for obtaining a system level non-determinism introducing function profile may be used (e.g., may be manually derived from technical documentation) .

In some examples, after a system level non-determinism introducing function profile is generated, the profile does not need to be re-generated unless the validation step identifies that the reading, storing and re-application of random values are not producing identical results. In this case there may be an additional non-determinism introducing function that is not captured in the function profile and another heuristic keyword search may be performed to identify missing functions.

FIG. 3 is a block diagram illustrating an example interception module architecture 300, that may be used to intercept system level non-determinism introducing functions called during machine learning model training in accordance with examples of the present disclosure.

In some examples, the interception module 300 leverages an application programming interface (API) hook mechanism to load a dynamic library 310 for intercepting system level non-determinism introducing functions called during machine learning model training instances. Prior to initiating a machine learning model training instance 350, software code may be first initialized to activate the interception module 300 and load associated instructions into the dynamic library 310. Within the dynamic library 310 are contained instructions to intercept system level functions called during machine learning model training instances. Contained within the dynamic library 310 is also instructions to map a list of non-determinism introducing functions contained within the non-determinism introducing function profile (e.g. identified using method 200) , to interception functions 320, such that the interception module 300 is directed toward which system level functions are non-determinism introducing functions that need to be intercepted during the machine learning model training instance 350.

Also prior to initiating a machine learning model training instance 350, an environment variable (e.g. LD_PRELOAD in Linux/Solaris/FreeBSD environments) can be pointed to the path of dynamic library 310 to pre-load that library into memory 112. In this way, upon the initiation of a machine learning training instance 350, the dynamic library 310 will be loaded into memory before any other shared libraries, and the instructions contained within dynamic library 310 will be given precedence over other libraries during the execution of any machine learning model training instance.

In the context of intercepting system level functions, an API hook mechanism is used to intercept API calls between two processes and invoke the customized functions in between (e.g., modifying the behavior of API calls, or recording the return values from original API calls) . Therefore, when a machine learning model training instance 350 is initiated, the interception module 300 will invoke the system library 330 (and the associated system level functions 340) on behalf of the model training instance 350 to retrieve returned random values.

During the machine learning model training instance 350, the interception module 300 executes instructions to read random values generated during machine learning model training and stores these random values into an intermediate output file, referred to herein as a training profile 360. In a subsequent second machine learning training instance, the interception module 300 retrieves the stored random values from the training profile 360, and (e.g., by leveraging the API hook mechanism) replaces the associated random values with stored values when interception functions 320 are called. In instances where a first and second machine learning training use the same input training data and training setup, execution of the interception module 300 during training helps to ensure that the same random values will be applied in both training instances, which may help to ensure that the results of both training instances will be the same (or substantially the same) .

FIG. 4 is a flowchart illustrating an example method 400 for validating a machine learning model, in accordance with examples of the present disclosure. The method 400 may be performed by the computing system 100. For example, the processing unit 102 may execute computer readable instructions (which may be stored in the memory 112) to cause the computing system 100 to perform the method 400. The method 400 may be performed using a single physical machine (e.g., a workstation or server) , a plurality of physical machines working together (e.g., a server cluster) , or cloud-based resources (e.g., using virtual resources on a cloud computing platform) .

At 402, a first machine learning model training instance is performed to train the machine learning model and obtain a training profile 360. The training profile 360 is an intermediate output file generated by the interception module 300 and contains the random values used for a first machine learning model training instance. Further details about step 402 are provided in the discussion of method 500 depicted in FIG. 5.

FIG. 5 is a flowchart illustrating an example method 500 for generating a training profile, in accordance with examples of the present disclosure. The method 500 may be performed by the computing system 100. For example, the processing unit 102 may execute computer readable instructions (which may be stored in the memory 112) to cause the computing system 100 to perform the method 400. The method 400 may be performed using a single physical machine (e.g., a workstation or server) , a plurality of physical machines working together (e.g., a server cluster) , or cloud-based resources (e.g., using virtual resources on a cloud computing platform) .

In some example aspects of the method, once a non-determinism introducing function profile is obtained and prior to the initiation of the first machine learning model training instance in step 504, software code is first initialized to activate the interception module 300 and load associated instructions into the dynamic library 310.

At 502, an environment variable (e.g. LD_PRELOAD in Linux/Solaris/FreeBSD environments) is pointed to the path of dynamic library 310 to pre-load that library into memory 112. In this way, upon the initiation of a first machine learning training instance in step 504, the dynamic library 310 will be loaded before any other shared libraries and the instructions contained within dynamic library 310 will be given precedence over other libraries during the execution of the second machine learning model training instance.

At 504, a first machine learning model training instance is initiated. Common software packages used in the machine learning model training process may include scikit-learn, numpy, tensorflow, among others.

In some instances, machine learning training may be performed on a GPU to take advantage of parallel operations and accelerate the training process. In this situation, an additional step may require hardware related libraries (e.g., CUDA and cuDNN) to perform the parallel operations to accurately address non-determinism introduced by parallel processing. In machine learning training, many processes involve large numbers of floating point execution. Due to the presence of floating point rounding error, the sequences in which processes are executed in parallel will impact the final results. For example, for a computation A+B+C, execution of sequence (A+B) +C will provide a different result than A+ (B+C) , since it is unknown whether the parallel processer will execute (A+B) or (B+C) first. To address this source of non-determinism in machine learning training, common machine learning software frameworks such as tensorflow and pytorch offer a method to mitigate this issue.

At 506, the non-determinism introducing functions called during the first machine learning model training instance are intercepted using the dynamic library 310. More specifically, the interception module 300 leverages an API hook mechanism to load the dynamic library 310 for intercepting the non-determinism introducing functions called during the first machine learning model training instances, invoking the system library 330 (and the associated system level functions 340) on behalf of the model training instance 350 to retrieve returned random values.

At 508, the interception module 300 obtains the random values generated from the non-determinism introducing functions called during the first machine learning model training instance.

At 510, the interception module 300 stores the random values that are generated from the non-determinism introducing functions called during the first machine learning model training instance into the training profile 360 so that the training profile 360 is generated.

Returning to FIG. 4, step 404, which is achieved by steps 406 to 412, performing a second machine learning model training instance to train the machine learning model using the random values stored within the training profile. In details, at 406, a second machine learning model training instance is initiated. Common software packages used in the machine learning model training process may include scikit-learn, numpy, tensorflow, among others. In some examples, the second machine learning model training instance may use the same input training data and training setup as the first machine learning model training instance.

At 408, non-determinism introducing functions called during the second machine learning model training instance are intercepted using the dynamic library 310. More specifically, the interception module 300 leverages an API hook mechanism to load the dynamic library 310 for intercepting the non-determinism introducing functions called during the second machine learning model training instances, invoking the system library 330 (and the associated system level functions 340) on behalf of the model training instance 350.

At 410, the corresponding random values generated by the non-determinism introducing functions during the first machine learning model training instance are retrieved from the training profile 360. In some examples, the interception module 300 may search for an existing populated training profile 360 containing random values generated from the non-determinism introducing functions called during the first machine learning model training instance. If in searching for an existing populated training profile 360, no populated training profile 360 is found, the interception module 300 may recognize that the current machine learning model training instance is a first machine learning model training instance (i.e., is not an attempt to reproduce a previous machine learning model training) and may end the method 400 and return to step 508 to obtain random values generated from non-determinism introducing functions called during the training instance.

At 412, the retrieved random values from the training profile 360 are used in the second machine learning model training instance. For each non-determinism introducing function that is called during the second machine learning model training instance, the interception module 300 applies the associated random values from the training profile 360 mapped through each interception function 320

In some examples, a log mechanism is used to store information in the system log, which can be checked to ensure that the second machine learning model training instance was performed using the random values saved during the first machine learning model training instance. After step 412, system logs can be examined to ensure that the random values stored in the training profile 360 were inserted into the second machine learning model training instance.

At 414 the results returned from the first and second machine learning model training instances are compared to ensure that the results are identical. If the results are found to be identical, the model is deemed to be reproducible and the machine learning model is stored as a validated reproducible machine learning model in step 416. If the results are found not to be identical, additional system level non-determinism introducing functions may need to be searched for as shown in method 200 and added to the non-determinism introducing function profile.

According to further example embodiments, an alternative method of ensuring the reproducibility of machine learning model training is to extract all the package level functions that have been invoked by the software libraries used in the machine learning training. Then, using keyword-based heuristics, all non-determinism introducing functions may be identified. These functions could then be instrumented with logging statements to record the random values used by the non-determinism introducing functions. The functions would be also instrumented with code statements that check if random values have already been recorded. If the random values have already been recorded and written into an intermediate file, when the machine learning training is rerun, the random values could be read from the intermediate files and used in the functions instead of using a random value.

According to further example embodiments, another alternative approach to ensuring reproducibility of machine learning model training is to use a random function (seeded with a meta random seed) to randomly select a set of seeds for all the software packages that can be configured with a random seed. Then only the meta random seed needs to be recorded and used in the replay phase to generate the exact same set of random seeds.

According to further example embodiments, the disclosed methods and devices could be used in applications other than machine learning systems. As long as a software system is impacted by randomness and reproducibility is one of the requirements of the behavior of the software systems, the disclosed methods and devices could be used to record the random states and replace those random states when needed.

Although the present disclosure describes methods and processes with steps in a certain order, one or more steps of the methods and processes may be omitted or altered as appropriate. One or more steps may take place in an order other than that in which they are described, as appropriate.

Although the present disclosure is described, at least in part, in terms of methods, a person of ordinary skill in the art will understand that the present disclosure is also directed to the various components for performing at least some of the aspects and features of the described methods, be it by way of hardware components, software or any combination of the two. Accordingly, the technical solution of the present disclosure may be embodied in the form of a software product. A suitable software product may be stored in a pre-recorded storage device or other similar non-volatile or non-transitory computer readable medium, including DVDs, CD-ROMs, USB flash disk, a removable hard disk, or other storage media, for example. The software product includes instructions tangibly stored thereon that enable a computing system to execute examples of the methods disclosed herein. The machine-executable instructions may be in the form of code sequences, configuration information, or other data, which, when executed, cause a machine (e.g., a processor or other processing unit) to perform steps in a method according to examples of the present disclosure.

The present disclosure may be embodied in other specific forms without departing from the subject matter of the claims. The described example embodiments are to be considered in all respects as being only illustrative and not restrictive. Selected features from one or more of the above-described embodiments may be combined to create alternative embodiments not explicitly described, features suitable for such combinations being understood within the scope of this disclosure.

All values and sub-ranges within disclosed ranges are also disclosed. Also, although the systems, devices and processes disclosed and shown herein may comprise a specific number of elements/components, the systems, devices and assemblies could be modified to include additional or fewer of such elements/components. For example, although any of the elements/components disclosed may be referenced as being singular, the embodiments disclosed herein could be modified to include a plurality of such elements/components. The subject matter described herein intends to cover and embrace all suitable changes in technology.

Claims

A method for validating reproducibility of a machine learning model, the method comprising:

obtaining a training profile containing random values used for a first machine learning model training instance to train the machine learning model;

performing a second machine learning model training instance to train the machine learning model using the random values stored within the training profile by:

initiating the second machine learning training model instance;

intercepting non-determinism introducing functions called during the second machine learning model training instance;

retrieving, from the training profile, corresponding random values generated by the non-determinism introducing functions during the first machine learning model training instance; and

using the retrieved random values in the second machine learning model training instance;

in response to validating that the results returned from the first and second machine learning model training instances are identical, storing the machine learning model as a validated reproducible machine learning model.
The method of claim 1 wherein obtaining the training profile comprises generating the training profile by:

initiating the first machine learning training model instance;

intercepting the non-determinism introducing functions called during the first machine learning model training instance;

obtaining random values generated from the non-determinism introducing functions called during the first machine learning model training instance; and

storing the random values generated from the non-determinism introducing functions called during the first machine learning model training instance into the training profile.
The method of claim 2, further comprising:

profiling, at a system level, non-determinism introducing functions used during the first machine learning model training instance, wherein a non-determinism introducing function profile is used to intercept the non-determinism introducing functions during the first or the second machine learning model training instance.
The method of claim 3, wherein profiling the non-determinism introducing functions at the system level comprises:

extracting all system level function calls used during the first machine learning model training instance;

identifying, from the extracted system level function calls, the non-determinism introducing functions via keyword-based heuristics; and

storing a list of the identified non-determinism introducing functions in the non-determinism introducing function profile.
The method of any one of claims 1-4, wherein performing the second machine learning model training instance further comprises:

searching for the training profile containing random values generated from the non-determinism introducing functions called during the first machine learning model training instance.
The method of any one of claims 1 to 5, wherein the non-determinism introducing functions are system level functions.
The method of any one of claims 3 to 6, further comprising:

in response to determining that the results returned from the first and second machine learning model training instances are different:

updating the training profile by:

searching for additional non-determinism introducing functions using keyword-based heuristics;

adding the additional non-determinism introducing functions to update the non-determinism introducing function profile;

using the updated non-determinism introducing function profile to identify and intercept the additional non-determinism introducing functions during the first machine learning model training instance; and

storing the random values generated from the additional non-determinism introducing functions in the training profile;

repeating the second machine learning model training instance using the updated training profile.
The method of any one of claims 1 to 7, further comprising:

pointing a computer environment variable to a dynamic library to give precedence to the instructions contained within dynamic library over other libraries during the execution of the first and second machine learning model training instances; and

intercepting the non-determinism introducing functions called during the first and second machine learning model training instances using the dynamic library.
The method of claim 8, wherein the interception module leverages an API hook mechanism to load the dynamic library for intercepting the non-determinism introducing functions called during the first or second machine learning model training instances.
The method of any one of claims 1 to 9, further comprising:

storing, in a system log, information indicating that the training profile generated during a first machine learning model training instance was applied to a second machine learning model training instance.
A device for validating reproducibility of a machine learning model, the device comprising a processing unit configured to execute instructions to cause the device to:

obtain a training profile containing random values used for a first machine learning model training instance to train the machine learning model;

perform a second machine learning model training instance to train the machine learning model using the random values stored within the training profile by:

initiating the second machine learning training model instance;

intercepting non-determinism introducing functions called during the second machine learning model training instance;

retrieving, from the training profile, corresponding random values generated by the non-determinism introducing functions during the first machine learning model training instance; and

using the retrieved random values in the second machine learning model training instance;

in response to validating that the results returned from the first and second machine learning model training instances are identical, store the machine learning model as a validated reproducible machine learning model.
The device of claim 11, wherein in obtaining the training profile, the processing unit is further configured to generate the training profile by executing the instructions to cause the device to:

initiate the first machine learning training model instance;

intercept non-determinism introducing functions called during the first machine learning model training instance;

obtain random values generated from the non-determinism introducing functions called during the first machine learning model training instance; and

store the random values generated from the non-determinism introducing functions called during the first machine learning model training instance into the training profile.
The device of claim 12, the device comprising a processing unit further configured to execute instructions to cause the device to:

profile, at a system level, non-determinism introducing functions used during the first machine learning model training instance, wherein a non-determinism introducing function profile is used to intercept the non-determinism introducing functions during the first and the second machine learning model training instances.
The device of claim 13, wherein in profiling the non-determinism introducing functions at the system level, the device comprising a processing unit further configured to execute instructions to cause the device to:

extract all system level function calls;

identify, from the extracted system level function calls, the non-determinism introducing functions via keyword-based heuristics; and

store a list of the identified non-determinism introducing functions in the non-determinism introducing function profile.
The device of any one of claims 11-14, wherein in performing a second machine learning model training instance, the processing unit is further configured to execute instructions to cause the device to:

search for the training profile containing random values generated from the non-determinism introducing functions called during the first machine learning model training instance.
The device of any one of claims 13 to 15, wherein in response to determining that the results returned from the first and second machine learning model training instances are different, the device comprising a processing unit further configured to execute instructions to cause the device to:

update the training profile by:

searching for additional non-determinism introducing functions using keyword-based heuristics;

adding additional non-determinism introducing functions to update the non-determinism introducing function profile;

using the updated non-determinism introducing function profile to identify and intercept the additional non-determinism introducing functions during the first machine learning model training instance; and

storing the random values generated from the additional non-determinism introducing functions in the training profile;

repeat the second machine learning model training instance using the updated training profile.
The device of any one of claims 11 to 16, the device comprising a processing unit further configured to execute instructions to cause the device to:

point a computer environment variable to a dynamic library to give precedence to the instructions contained within the dynamic library over other libraries during the execution of the first and second machine learning model training instances; and

intercept the non-determinism introducing functions called during the first and second machine learning model training instances using the dynamic library.
The device of claim 17, wherein the interception module leverages an API hook mechanism to load the dynamic library for intercepting the non-determinism introducing functions called during the first or second machine learning model training instances.
The device of any one of claims 11 to 18, the device comprising a processing unit further configured to execute instructions to cause the device to:

store, in a system log, information indicating that the training profile generated during the first machine learning model training instance was applied to the second machine learning model training instance.
A non-transitory computer readable medium storing instructions thereon, wherein the instructions, when executed by a processing unit of a device, cause the device to:

validate reproducibility of a machine learning model, comprising:

obtaining a training profile containing random values used for a first machine learning model training instance to train the machine learning model;

performing a second machine learning model training instance to train the machine learning model using the random values stored within the training profile by:

initiating the second machine learning training model instance;

intercepting non-determinism introducing functions called during the second machine learning model training instance;

retrieving, from the training profile, corresponding random values generated by the non-determinism introducing functions during the first machine learning model training instance; and

using the retrieved random values in the second machine learning model training instance;

in response to validating that the results returned from the first and second machine learning model training instances are identical, storing the machine learning model as a validated reproducible machine learning model.
The non-transitory computer readable medium of claim 20, wherein the instructions further cause the device to, for obtaining the training profile:

generate the training profile by:

initiating the first machine learning training model instance;

intercepting the non-determinism introducing functions called during the first machine learning model training instance;

obtaining random values generated from the non-determinism introducing functions called during the first machine learning model training instance; and

storing the random values generated from the non-determinism introducing functions called during the first machine learning model training instance into the training profile.
The non-transitory computer readable medium of claim 21, wherein the instructions further cause the device to:

profile, at a system level, non-determinism introducing functions used during the first machine learning model training instance, wherein a non-determinism introducing function profile is used to intercept the non-determinism introducing functions during the first or second machine learning model training instances.
The non-transitory computer readable medium of claim 22, wherein the instructions further cause the device to:

extract all system level function calls used during the first machine learning model training instance;

identify, from the extracted system level function calls, the non-determinism introducing functions via keyword-based heuristics; and

store a list of the identified non-determinism introducing functions in the non-determinism introducing function profile.
The non-transitory computer readable medium of any one of claims 20-23, wherein the instructions further cause the device to, when performing the second machine learning model training instance:

search for the training profile containing random values generated from the non-determinism introducing functions called during the first machine learning model training instance.
The non-transitory computer readable medium of any one of claims 22-24, wherein the instructions further cause the device to, in response to determining that the results returned from the first and second machine learning model training instances are different:

update the training profile by:

searching for additional non-determinism introducing functions using keyword-based heuristics; and

adding the additional non-determinism introducing functions to update the non-determinism introducing function profile; and

using the updated non-determinism introducing function profile to identify and intercept the additional non-determinism introducing functions during the first machine learning model training instance; and

storing the random values generated from the additional non-determinism introducing functions in the training profile; and

repeat the second machine learning model training instance using the updated training profile.
The non-transitory computer readable medium of any one of claims 20-25, wherein the instructions further cause the device to:

point a computer environment variable to a dynamic library to give precedence to the instructions contained within the dynamic library over other libraries during the execution of the first and second machine learning model training instances; and

intercept the non-determinism introducing functions called during the first and second machine learning model training instances using the dynamic library.
The non-transitory computer readable medium of any one of claims 20-26, wherein the instructions further cause the device to:

store, in a system log, information indicating that the training profile generated during a first machine learning model training instance was applied to a second machine learning model training instance.
A computer readable medium having instructions encoded thereon, wherein the instructions, when executed by a processing unit of a system, cause the system to perform the method of any one of claims 1 to 10.
A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of any one of claims 1 to 10.