From PyTorch to Mosaic: An Overview and First Impressions

December 19, 2024

•

0 min read

•

Aaron McClendon

Artificial Intelligence

No items found.

Databricks

From PyTorch to Mosaic: An Overview and First Impressions

*Figure 1 Hallelujah - we can finally simplify our LLM training loop with Mosaic Composer*

Our Aimpoint Digital Labs team has been training large language models from scratch for a bit now and has learned a few lessons the hard way. Scaling models to large sizes can become almost more of an engineering challenge than an AI one. Our codebase was originally built by combining several repositories, including (not limited to) the excellent OLMo codebase. Turns out, the OLMo models were trained with help from the Mosaic ML team, who has also released some nifty code packages of their own. In our constant search to improve our training efficiency and ease of use, we recently had the opportunity to try out some of these Mosaic offerings. When initially going about integrating packages like Composer into our PyTorch codebase, we had questions such as how Composer compares to other popular frameworks like PyTorch Lightning, what integrations look like with frameworks like DeepSpeed and FSDP/DDP, and what some pros and cons are to using the Mosaic stack. Additionally, we did not know how much effort would be required to convert a custom transformer codebase from PyTorch into Composer and ultimately submit training runs using MCLI.

Over a series of three blogs, we hope to provide a technical reference regarding how some of these popular frameworks compare against each other, how to migrate a PyTorch codebase into Composer, and why we would want to do that in the first place. It assumes a working knowledge of AI, and good familiarity with Python code frameworks like PyTorch. We will also cover some technical concepts related to Transformers specifically. If you need a quick primer, we recommend the PyTorch documentation as a guide. This first blog will cover an overview of Composer. Be sure to check out our follow up blogs where we dive deeper into unlocking Composer and migrating.

What is Composer?

Mosaic composer is an open-source framework, freely available, that helps to augment training advanced AI systems through ease of integration with both prebuilt and custom algorithms.

Composer can help speedup and simplify workflows training models such as transformers, diffusion models, CNNs, and more. We will get into more specific details later, but some of the main benefits of using Composer to train your models are:

Custom speedup algorithms
- Composer features a variety of specially developed algorithms that can be easily inserted into a training workflow using callbacks
OOM prevention
- Automatic right-sizing of micro batch sizes based on available GPU vRAM to help prevent OOM errors
Checkpoint resumption
- Automatically resume from your most recent checkpoint after any training failure just by rerunning your training loop
Easy logging
- Stream logging and checkpointing to cloud and other easily accessible sources
Elastic Checkpoint sharding
- Resume training on varying size GPU clusters without heavy config file edits
Direct integration with cloud data storage
- Stream data from cheap cloud storage locations using MDS data format

You can get up and running with Composer by following the helpful QuickStart guides in the documentation. In the next section we will get into more specific details on some of the above-mentioned features and how to integrate them into your codebase.

‍What are some key features of Composer and Mosaic MCLI?

Before going into more specific details around the Composer package, let’s first review the MCLI terminal integration which we used during some recent model runs and which easily integrates with the open-source Composer package.

MCLI

MCLI is the command line interface that can be used to submit pretraining and finetuning model runs to Mosaic’s cloud compute. After getting a compute cluster attached, you can get your account setup and log into the simple to use GUI:

The GUI allows you to see compute clusters attached to your organization for advanced training runs. Note that Mosaic’s recent acquisition by Databricks has led to migration of some of the previously available features (inference and finetuning) to be integrated more directly into the Databricks platform.

You can set up MCLI as easily as:

We will go step by step through a complete run submission with code examples in a subsequent blog but recommend reviewing the documentation for MCLI here. MCLI provides an environment in which you don’t have to individually manage compute and can focus more on AI engineering rather than setting up environments. You can easily submit a training run using a YAML file:

And then:

You’ll see run progress in your terminal (note the screenshot is for an actual training run, a “hello world” echo typically requires less compute :D):