Automl Using Nvidia Clara and Organizing Jobs Using MLFlow

Published in

DLIterations

12 min readMar 2, 2021

In this blog, I plan to explain the Nvidia Clara Train’s AutoML module and the integration of Nvidia Clara’s AutoML with an ML lifecycle management tool called MLFlow. I plan to provide the context for such an integration along with the code snippets.

Nvidia Clara train is a framework that attempts to provide an end-to-end workflow for all Deep Learning training needs involving Medical Imaging. From an engineering perspective, fundamentally, there are two loops in training a deep learning model, an outer loop through the hyper-parameters and an inner loop performing the weight training. Nvidia Clara Framework has support for both the above-mentioned loops. Clara AutoML is a framework that offers help with the outer loop. Clara separates the autoML search into two parts. It calls the various areas mentioned above for AutoML as search spaces. Each search space offers capabilities to perform the search on the parameters that it calls parameter search.

While performing this search for the best model parameters, the training is run multiple times, and a run is called a job. A researcher needs to keep track of myriad things across the two loops mentioned above. MLFlow provides a platform to track all the jobs. It helps track the input data, hyperparameters for the run, metrics, and other related files.

What is the problem? Why AutoML?

The success of the model training involves optimizing the model parameters or weights. This training depends on the choices made during the data preparation, feature engineering, model selection, parameters affecting the training loop, and model evaluation. These choices are called hyperparameters. Hyperparameter optimization is a challenging problem for large models used in deep learning. The following four equations explain this search problem.

Mathematical formulation of AutoML Problem

Let 𝓐 be the model we are trying to learn, and let us further assume that this training is trying to learn a set of weights θ. Equation 1 provides the formulation of the problem of hyperparameters. Say, λ⁽*⁾is the set of hyperparameters we are trying to figure out. Here we can observe the two loops involved in training deep learning. There is an inner loop to arrive at the model 𝓐 to train θ. The outer loop tries to choose the best hyper-parameters λ¹.

The outer loop is the most challenging because this optimization does not have an analytical formulation. Equation 4 shows that without an analytical formulation or knowledge of the response function Ψ or the hyperparameter search space Λ. We try to find the best λ using a limited amount of training data.

As the model complexity increases, the search space for these parameters increases. Brute forcing using grid search might become computationally expensive, and manual section using domain knowledge heuristics often does not lead to these parameters’ optimal choice. It is a challenge due to the sheer size of the search space and the computational expertise needed to perform such a search.

Promise of AutoML

AutoML is the process of automating various aspects of model training. AutoML brings the promise of making model training techniques accessible to non-experts. One of the essential tasks AutoML tries to address is hyperparameter search. NVidia Clara framework seamlessly integrates both the deep learning training loops. The outer loop performs hyperparameter optimization through the AutoML module and the inner loop, which trains a model given a set of hyperparameters.

Following are the advantages of AutoML. In this blog, the focus is on Clara AutoML Module.

Clara is a framework specializing in the needs of Healthcare and Life Sciences AI development. Nvidia Clara Train focuses on deep learning for medical imaging. Clara Train SDK, starting in version3.0, introduced the AutoML module. The main advantage of using the NVidia Clara framework is that it seamlessly integrates both the deep learning training loops mentioned above.

Nvidia Clara Train Design Philosophy

The pipeline for deep learning training provided by the Nvidia Clara train framework is based on the “Inversion of Control(IOC)” design pattern using “dependency injection(DI)” and “Event-Driven Programming(EDP).”

As the name says, it inverts the control of code execution for the loops of deep learning training. IOC is about the separation of the concerns.

Clara provides the skeleton developed using its engineering expertise, and it gives back control of the pipeline using two patterns called “dependency injection” and “Event-driven Programming”.

The framework transfers the task of creating the python objects to the researchers and using the researcher-developed object in the code flow is called dependency injection. Clara provides scalable, robust, and well-tested skeleton code. It takes away the burden of doing the necessary house-keeping code to establish the hyper-parameter and the training loop, which allows there searcher to focus on the algorithm and model development.

Further, the pipeline’s main code knows about the researcher-developed dependencies using a configuration file called “config_train.conf”. Here comes another significant advantage of using the Clara train. In general, these dependencies are the pipeline’s critical components like data transformers, models, loss functions, metrics, etc.

Nvidia provides a rich library of these components out-of-the-box, which could be configured directly without writing any code.

Nvidia calls it “Bringing Your Own Components’’ (BYOC) if these dependencies are developed by the user. For researchers, the flexibility brought by BYOC is critical for using Clara for their deep learning training needs. Researchers can mix and match the Nvidia developed component with their own components.

Another advantage is the modularity of these components allows them to share with other researchers.

Apart from the dependency injection, Clara provides access to the run-time information during the pipeline execution using EDP. When events occur, the framework invoking hook methods on the handler objects registered in the configuration file. Another object which helps pass information from one step to another step in the pipeline is the context object. These context objects are accessible in the objects for dependency injection and for event handling providing access to the information held during the prior steps and pass information to the next steps in the data processing pipeline.

The three main concepts of “Dependency Injection,” handler based on “Event-Driven Programming,” and runtime information sharing using context objects provide a tremendous amount of flexibility needed for the researchers to use this framework successfully.

Inversion of Control Concerns

The very advantage of using the IOC pattern of helping the researcher from establishing the loops and the associated house-keeping might become a disadvantage. With the IOC, the control is inverted, and it is abstracted away from the user. The configuration file replaces the intuitive and straightforward logic of the loops. These configuration files are not intuitive enough, and this might add to the learning curve. Clara is not open source, so it would be hard to understand the errors and debug them. It might not be possible to implement user-specific optimization to the main loop.

Nvidia tries to provides a great deal of flexibility by using DI and EDP.

It also implements the most common components to follow the design principle of “Convention over configuration’’, allowing users to perform the deep learning training with zero code for some scenarios.

This approach makes configuration choices for the users using industry best practices. There are limitations to Clara’s flexibility, and care should be exercised to use the framework only for the supported use cases.

For these supported cases, a lot of engineering advantages like “Automatic Mixed Precision’’, “Data Parallelism’’, “Determinism’’, “Smart Cache’’, “AutoML’’, and “Federated Learning’’ are available without much coding.

Medical Model ARchive (MMAR)

Configuration files are fundamental for implementing the IOC, DI, EDP design patterns, and the principle of “Convention over configuration’’. Clara uses JSON for creating these configuration files. Along with the configuration, a framework needs a project structure to organize all the artifacts. Clara Train calls this Medical Model Archive. It is a self-containing directory structure to hold all the config files, shell scripts, documentation, other artifacts needed for housekeeping, and the generated models. Following are the advantages of this strategy.

It is easy to establish a project root directory and uses a relative path in all the scripts and config files. This relative root path is a common strategy employed by major frameworks to achieve the portability of the project artifacts. It is easy to make a zip file of the directory structure and move it to a different location without modifying the scripts or config files.
It is easy to version all the artifacts together. This versioning capability is an important feature for researchers to create an experiment to track changes. Also, a unified project structure makes it easy to share this code as well as results.
Nvidia also uses the MMAR structure to share various Clara train projects through its registry called NGC⁷.
A standard directory structure also establishes a standard way of managing the life-cycle of a feature. It reduces the learning curve. If a researcher is familiar with Clara Train, intuitively, he/she can learn how to use AutoML, Federated Learning, etc.
The project structure also makes it easy to establish governance and manage resources.

In the figure below, highlighted parts are the autoML artifacts. “automl.sh’’ is the main script. It launches all the jobs. “automl_train_round.sh’’ is the script used to launch individual jobs. The configurations are stored in “config_automl.json’’. The output is stored in a directory called automl.

Types of Hyperparameters

A framework supporting AutoML needs to provide a way to create the search space for two types of hyperparameter. The first type is individual hyperparameters, and the second type is conditional hyperparameters.

Individual Hyperparameters

Following are the four types which are individual hyperparameters.

Continuous value type, these are of type floating-point values, for example, the learning rate.
Discrete value type, these are of the type integer which takes a specif range of discrete values. For example, the number of splits in k-fold cross-validation.
Enumerate type Hyperparameters, these are a collection of discrete values of a finite domain and unordered. For example, a set of batch sizes or they can be a set of activation functions.
Binary or Boolean Values, these hyperparameters take true or false or a value of Zero or One. For example, an indicator or flag to enable or disable a certain feature.

Conditional Hyperparameter

The framework needs to support the creation of search spaces for the parameters which depend on each other. These are the following type of dependencies.

A hyperparameter should only be active when another hyperparameter is active.
Enable, Disable two hyperparameter in a mutually exclusive manner.
Depending on a particular hyperparameter, enable or disable multiple hyperparameters.

Key Components of Clara AutoML Framework

The components for Clara AutoML are an engine, controller, scheduler, executor, and handlers. High-level architecture is shown in the figure below.

High Level Architecture of Nvidia Clara AutoML

Engine is responsible for all the coordination between the controller, schedular, and the executor. Engine is also responsible for invoking the handler hooks on the firing the events during runtime. The role of the executor is to execute the jobs scheduled by the schedular. Controller creates the search space based by implementing the algorithm for search. Some of the algorithms try to reduce the search space by adopting some heuristic stemming. The search is performed in a small batch of jobs called recommendations generated by the controller. Even for the grid search, which is a sort of brute force search, the controller tries to stop the job scheduling early if the desired result is achieved.

Reinforcement Controller

The Reinforcement controller is based on reinforcement learning⁹. It separates the search space into enum subspace and float subspace. It uses reinforcement learning to generate recommendations for the float subspace. It takes these recommendations and pairs them with the enum subspace to create the final recommendations and passes them to the schedular.

Bring Your Own Components and Clara AutoML

Keeping with the philosophy of “Convention over configuration,’’ Clara, even for the AutoML module, provides controller, executors, schedulers, and handler via configuration. Simultaneously, all these dependencies could also be developed by the user, and DI is achieved using “config_automl.json’’. The components developed by the user are called BYOC. In the case of AutoML, a complete custom AutoML system would be possible. This would help in implementing hyperparameter optimization algorithms not provided out of the box by Clara.

Clara AutoML Workflow

When the AutoML module is launched, it launches three threads. The Engine and the controller share a thread. The scheduler runs in the second thread. Throughout the whole execution, these threads would be running. The third type of thread is for executing a run. Depending on the resources, multiple threads are launched by the scheduler.

The execution of this module starts with the executor creating the search space based on the provided configuration. This search space is handed over to the controller to generate the recommendations. These recommendations are passed on to the scheduler to launch jobs for each of the recommendations. The results are passed back to the controller to fine-tune the next set of recommendations based on the results. This forms the outer loop mentioned in the section “What is the problem? Why AutoML?”. This loop keeps running till all the recommendations are exhausted. This loop can also be stopped if the desired score is achieved. Each job runs the inner training loop mentioned above.

Configuring search space in “configtrain.json”

Different types of hyperparameters specified in section “Types of Hyperparameters” can be configured in the “config_train.json’’. To configure a search space, the user needs to provide four parameters.

“domain’’, it is the domain of the parameter. The supported domains are learning rate “lr’’, network “net’’, and transforms “transform’’.
the type of the hyperparameter
argument or the attribute
finally, the targets. These are the ranges of the actual values to search

In the figure below, the argument “use_amp’’ has boolean search space, which can take the value of “true’’ or “false’’.

The figure below shows the example for configuring search space for float value. For the float values, we need to specify the range with minimum value and maximum value. In this example, “learning_rate’’ takes values between 0.0001 and 0.001. Relevant portions are highlighted in the figures

The figure below shows the example for configuring search space for a list of floating values. In this example, “poly_power’’ takes values 0.9 and 0.99. All the above examples are for configuring a single hyperparameter.

In the figure below, the conditional hyperparameter configuration is shown. In this example, either “mySearchLoss1'’ or “mySearchLoss2'’are enabled. They are mutually exclusive.

Conditional Hyperparameter, Enable mutually Exclusive Example.

Integrating with MLFlow

Clara AutoML generates multiple jobs. The output for all the runs is organized in the MMAR under the directory automl. Clara does not offer a clean way to organize all the job runs for analysis.

To organize all the AutoML Jobs, NVidia Clara is integrated with MLFlow. This integration is achieved by using a custom handler.

Handlers in Clara AutoML

Handlers are based on the Event-Driven Programming paradigm. The code in the handler is fired during the occurrence of specific events during AutoML runtime. Clara AutoML Supports the following events.

recommendations_available — This is fired when the recommendations are available. startup — This is fired at the start of the AutoML module.
shutdown — This is fired when there are no more recommendations available.
start_job — This is fired at the start of a job run.
round_ended — This is fired when one round of recommendations is completed.
end_job — This is fired at the end of a job run.

MLFlow Integration

The strategy for the integration of Clara and MLFlow is to use two events. (1) “startup’’ event and (2) “end_job’’ event. At the beginning of the module, the experiment is set up as shown in the figure labeled “MLFlow integration code in the startup event.” At the end of each job, the run was added to the experiment created in the startup event, as shown in the figure labeled “MLFlow Integration code in the end_job event.”

MLFlow Integration code in the end_job event

Additional Resources

A detailed document is available here.
Clara Train examples are available here.

References

[1]James Bergstra and Yoshua Bengio. Random Search for Hyper-Parameter Optimization. page 25

[3]Frank Hutter and Joaquin Vanschoren. Automatic Machine Learning (AutoML): A Tutorial. page 60.

[5]Gustavo Pabon and Mario Leyton. Tackling Algorithmic Skeleton’s Inversion of Control. In2012 20th Euromicro International Conference on Parallel, Distributed and Network-BasedProcessing, pages 42–46, Munich, Germany, February 2012. IEEE. ISBN 978–1–4673–0226–5.doi:10.1109/PDP.2012.86.

[8]Powering AutoML-enabled AI Model Training with Clara Train. https://developer.nvidia.com/blog/powering-automl-enabled-ai-model-training-with-clara-train/, April 2020.

[9]Dong Yang, Holger Roth, Ziyue Xu, Fausto Milletari, Ling Zhang, and Daguang Xu. SearchingLearning Strategy with Reinforcement Learning for 3D Medical Image Segmentation.arXiv:2006.05847 [cs], June 2020.