Unlocking Mosaic Composer: Features, Benefits, and Integrations

Table of contents
Partner with
Aimpoint Digital
Meet an Expert

If you missed our overview of Mosaic Composer, be sure to check that out before diving deeper. Let’s now get a little more detailed into some of the features offered by the Composer open source package. Note that this package integrates directly into MCLI, and can also be used separately in whatever compute environment you have available. 

Mosaic Data Shards (MDS)

Note that MDS isn’t really a direct part of the Composer package but integrates with it closely and we will be using it extensively here, so we cover this first. 

MDS is a data format that can be used to stream datasets either locally or to a remote compute location. You can convert your data into MDS format with a simple loop like:

Figure 4 Taken from  https://github.com/mosaicml/streaming

MDS provides an alternative to PyTorch Dataloaders, with optimized IO operations for high throughput sampling. Similar to PyTorch’s Datasets, you can access samples with indexing as simply as dataset[i], even if the sample hasn’t been fully loaded yet. 

MDS also supports any datatype, making it versatile for deep learning applications, as compared to PyTorch Datasets which require some additional configuration for multimodal data support. Some of the key benefits of using MDS compared to PyTorch Datasets or base numpy memmap datasets are:

  • True streaming support and seamless cloud integration
  • Random access
    • Get samples from your dataset with easy indexing, even if the sample hasn’t been locally downloaded yet.
  • No divisibility requirements
    • MDS removes the requirement to make datasets divisible by the number of devices, automatically adjusting sample selection to ensure each device processes the same number of samples per epoch. This avoids the need to pad or drop samples, enhancing flexibility compared to traditional PyTorch datasets.
  • Disk space management
    • Set limits for how much data to store in memory before removing used sample.

We will show how to integrate MDS with Composer in later sections. 

Training Loops

Mosaic Composer can dramatically simplify training loops with the flexible Trainer class. We can speak from experience, previously, with base PyTorch, our custom LLM training loop took several hundred lines of code. After migrating into Mosaic’s Composer, our “loop” is now just a few lines of code, yet maintains all the same functionality. In this section we will cover some of the key features of the Trainer class. This is intended to be a high-level summary; for detailed information we recommend you read the well written documentation

When using the Mosaic Trainer, you need to pass it an instance of a Composer model. This can be a prebuilt architecture, or a custom model you have integrated with the ComposerModel class. Later in this paper we will give an example with a very custom model. After doing this, you can train the model as simply as:

The trainer class is quite powerful and provides a lot of functionality out of the box. Our original training loop proceeded with many custom defined functions:

The Trainer class performs both a training and evaluation loop, and can operate in distributed environments directly, like the PyTorch offerings. There is current support for popular distributed training approaches, such as DeepSpeed and FSDP. You can also easily integrate cloud storage locations to save model checkpoints simply by passing a filepath into the save_folder attribute. 

Figure 5 Small snippet of our original training loop using PyTorch

SpeedUp Algorithms

One of the features that inspired construction of the original Mosaic Composer library was the desire to integrate custom speed up algorithms. When training more advanced models, often one wants to insert customized techniques from newly output research papers, AI groups, open-source libraries, and more. 

As an example, in our LLM training codebase, we integrated a novel method from Google research group, Infini-attention: 2404.07143 (arxiv.org). The method was suggesting augmenting the standard Attention mechanism to allow for technically infinite length context windows, similar to how a State Space Model  works. We won’t go into technical details on the method here. However, inserting this custom technique into our model architecture required careful planning to avoid contaminating the larger codebase and ensuring smooth integration with the rest of the classes being used. 

Additional pain can come when attempting to use multiple custom algorithms. It is possible to do it in PyTorch and also PyTorch Lightning, but can get clunky quickly. With Composer, integrating custom algorithms is straightforward:

Figure 6 A quick example from the Mosaic documentation for custom algorithm integration

Large codebases can quickly become unwieldy with many custom algorithms and people writing those algorithms; Composer aims to solve this issue with the use of Events and State. There are many Events that occur throughout a training loop, including fetching the data, the initial forward pass, backward pass, loss calculation, and potentially more. Composer allows us to inherit from the Algorithm class and hook into certain events in our training loop. This is a key benefit of Composer: the ability to easily hook into our existing training loop with custom algorithmic augmentations. Within a custom algorithm we can specify the event hook we want to connect to, and then simply pass that algorithm into our Train class. 

Figure 7 Example of a custom speed up algorithm

The example shown above is directly from the Mosaic documentation and shows how one could easily integrate a custom algorithm (in this case to drop certain pieces of information from a data sample). The custom algorithm in the example above has a hook into the Event.AFTER_DATALOADER event and will therefore post process data samples returned from the data loaders. 

Composer Summary

Composer offers many other benefits as well, including direct integration with Huggingface, Auto-microbatching (you can avoid those annoying CUDA: OOM errors by simply changing an attribute of the Trainer class to automatically right size your microbatches per device), early stopping, and more! For details and advanced tutorials, again, we suggest you read through the well written documentation

Next up in our Mosaic Composer blog series, we are going to dive deep into migrating to Composer.

Author
Aaron McClendon
Aaron McClendon
Head of AI
Read Bio

Let’s talk data.
We’ll bring the solutions.

Whether you need advanced AI solutions, strategic data expertise, or tailored insights, our team is here to help.

Meet an Expert